Skip to main content

Command Palette

Search for a command to run...

Seccomp in Kubernetes

From Pod Spec to Syscall Boundary

Updated
7 min read
Seccomp in Kubernetes

In Part 1, we stayed close to the kernel.

We watched a process call uname(), attach a seccomp filter, and then get shut down at the syscall boundary. No permissions debate. No LSM policy. No capability check. The kernel simply said: that syscall does not exist for you anymore.

Clean. Brutal.

But what about the Kubernetes part? You're probably already running seccomp.

Not because you enabled it. Not because you wrote a profile. And definitely not because you tuned it.

You're running it because your container runtime turned it on for you. When a container starts, the application doesn’t install a seccomp filter. The container runtime does. Docker, containerd, etc. attach a default profile before your code ever runs. Kubernetes doesn’t enforce syscalls. It simply tells the runtime which profile to use. The actual enforcement still happens at the same kernel boundary we saw in Part 1. And once seccomp moves from "toy C demo" to "running cluster," the questions change.

Not the stuff we know:

  • What is a syscall?

  • How does BPF work?

But:

  • What profile is actually active on my pods?

  • What does it allow?

  • And what happens if I turn it off?

That’s where we’re going.


Orientation Diagram

Keep this diagram in mind.

Everything in this post is about how a seccomp profile defined in a Pod spec ends up enforced inside the kernel. Kubernetes selects the profile. The container runtime attaches it. The kernel evaluates every syscall against it.

The enforcement point hasn’t moved. It’s still the syscall boundary we explored in Part 1. What’s changed is the plumbing that decides which filter gets there.


The Most Common Case: Nothing Configured

In many clusters, pods don’t specify a seccomp profile at all.

The pod spec is silent.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: no-seccomp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: no-seccomp
  template:
    metadata:
      labels:
        app: no-seccomp
    spec:
      containers:
        - name: nginx
          image: nginx

It’s simply not present. So what happens? It will run completely unconfined. Create this deployment and check inside the container.

matt@cp:~/seccomp$ kubectl exec no-seccomp-75d54c6445-s7ln8 -- grep Seccomp /proc/1/status
Seccomp:    0
Seccomp_filters:    0

We can see Seccomp: 0, which of course means no seccomp.


A Simple Way to See Seccomp in Action

One syscall commonly blocked by the runtime’s default seccomp profile is keyctl.

keyctl interacts with the Linux kernel keyring subsystem. Most containers don’t need to manage kernel keyrings, so the default profile blocks it as unnecessary attack surface.

If you’re using the basic Nginx image, you can install a small test tool:

matt@cp:~$ kubectl exec -it no-seccomp-75d54c6445-s7ln8 -- /bin/bash
root@no-seccomp-75d54c6445-s7ln8:/# apt-get update && apt-get install -y keyutils

Then run:

root@no-seccomp-75d54c6445-s7ln8:/# keyctl show
Session Keyring
 857262715 --alswrv      0     0  keyring: ...

As we would expect.

Now why this example? The kernel keyring subsystem stores sensitive material such as session keys. Direct interaction with kernel-managed key storage is not something typical application containers need.

From an attacker’s perspective, however, kernel keyrings can become part of privesc and more.

Blocking keyctl removes that entire class of risk (sound familiar?). That’s the core idea behind seccomp: if the workload doesn’t need it, the syscall doesn’t exist.


RuntimeDefault: Making the Baseline Explicit

Instead of relying on cluster defaults (aka nothing), you can declare your intent directly. Let's give it a shot in a new deployment.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: yes-seccomp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: yes-seccomp
  template:
    metadata:
      labels:
        app: yes-seccomp
    spec:
      containers:
        - name: nginx
          image: nginx
          imagePullPolicy: IfNotPresent
          securityContext:
            seccompProfile:
              type: RuntimeDefault

This tells Kubernetes to use the default seccomp profile provided by the container runtime. The runtime attaches that profile before the container process starts. The kernel enforces it on every syscall. Just a filter at the syscall boundary.

Let's run the same exercise as before.

matt@cp:~$ kubectl exec -it yes-seccomp-75d54c6445-s7ln8 -- /bin/bash
root@no-seccomp-75d54c6445-s7ln8:/# apt-get update && apt-get install -y keyutils

Then run:

root@no-seccomp-75d54c6445-s7ln8:/# keyctl show
Session Keyring
Unable to dump key: Operation not permitted

Voila, syscall filtering at is finest. But what exactly is that filter?


The Actual RuntimeDefault Profile

If you want to see what RuntimeDefault really means on your node (and why wouldn't you?), inspect the OCI runtime spec the container runtime handed to runc.

First, list running containers and grab a container ID:

matt@cp:~/seccomp$ sudo crictl ps | grep yes-seccomp
...
d7ebc42cadd5f       2af158aaca82b       14 minutes ago      Running             nginx                       0                   e3aa4f4782108       yes-seccomp-796856b464-hqq44               default

Now inspect the container and extract the exact seccomp configuration from the runtime spec:

matt@cp:~/seccomp$ sudo crictl inspect d7ebc42cadd5f | jq '.info.runtimeSpec.linux.seccomp'

If seccomp is enabled, you’ll see something like:

{
  "architectures": [
    "SCMP_ARCH_ARM",
    "SCMP_ARCH_AARCH64"
  ],
  "defaultAction": "SCMP_ACT_ERRNO",
  "syscalls": [
    {
      "action": "SCMP_ACT_ALLOW",
      "names": [
        "accept",
        "accept4",
...
          "op": "SCMP_CMP_MASKED_EQ",
          "value": 2114060288
        }
      ],
      "names": [
        "clone"
      ]
    },
    {
      "action": "SCMP_ACT_ERRNO",
      "errnoRet": 38,
      "names": [
        "clone3"
      ]
    }
  ]
}

This is the actual seccomp profile applied to the container. And it is easy to see what is explicitly allowed by looking at the names of syscalls under SCMP_ACT_ALLOW.

And if you actually want to see how many syscalls are allowed, it is easy to take a look.

matt@cp:~/seccomp$ sudo crictl inspect d7ebc42cadd5f | jq '
  .info.runtimeSpec.linux.seccomp.syscalls
  | map(.names) | add
  | unique
  | length
'
...
377

Unconfined and Privileged

RuntimeDefault is a baseline. But it’s not guaranteed. In Kubernetes, seccomp can be explicitly disabled:

securityContext:
  seccompProfile:
    type: Unconfined

That tells the runtime not to attach a seccomp filter at all. When that happens, the container process runs with full access to the kernel’s syscall surface (subject to capabilities and LSMs, but without syscall filtering).

Adjust the deployment with this new securityContext and you can verify it the same way as before:

matt@cp:~/seccomp$ kubectl exec unconfined-seccomp-6489c66986-vslvv -- grep Seccomp /proc/1/status
Seccomp:    0
Seccomp_filters:    0

There’s another common way seccomp effectively disappears: privileged containers. When a container runs privileged, it is granted elevated access to the host.

securityContext:
  privileged: true
  seccompProfile:
    type: RuntimeDefault

Adjust the deployment with this new securityContext and you can verify it the same way as before:

matt@cp:~/seccomp$ kubectl exec priv-seccomp-57778f5d79-zdm9x -- grep Seccomp /proc/1/status
Seccomp:    0
Seccomp_filters:    0

As you can see, privileged wiped out the intent of the seccompProfile. User beware.


Making Seccomp the Default (kubeadm)

Declaring RuntimeDefault in every pod spec works. But there's another way! You can read it in the CIS Kubernetes Benchmark under 4.2.14.

Modern Kubernetes supports making the runtime’s default seccomp profile the automatic baseline for all pods that don’t explicitly specify one. On kubeadm clusters, this is controlled by the kubelet.

You want:

seccompDefault: true

You can verify whether it’s enabled on a node:

kubectl proxy &
curl -s http://127.0.0.1:8001/api/v1/nodes/<node-name>/proxy/configz   | jq '.kubeletconfig.seccompDefault'

If it returns true, pods without an explicit seccompProfile will automatically run under the runtime’s default profile.

If it’s false, a pod that doesn’t declare seccomp may run completely unconfined.

To enable it in kubeadm, update your KubeletConfiguration, which should be in var/lib/kubelet/config.yaml. Simply add the default setting.

seccompDefault: true

Then apply the configuration and restart the kubelet. With seccompDefault: true, RuntimeDefault becomes the cluster-wide baseline instead of an opt-in setting. Not too bad.


Wrap Up

It’s tempting to treat RuntimeDefault as some kind of security panacea, but of course it isn't.

The runtime’s default seccomp profile is designed to be broadly compatible. It allows hundreds of syscalls because most applications need them.

What it blocks are the obvious outliers like kernel keyring manipulation (bet you didn't think I would go there). That’s valuable. It reduces attack surface. But it does not create tight isolation.

If your application only needs 80 syscalls, and the runtime allows 377, you’re still exposing far more kernel surface than strictly necessary.

RuntimeDefault is a baseline hygiene control.

It says:

“We’re not going to allow clearly dangerous or unnecessary syscalls.”

It does not say:

“This workload has a minimal, workload-specific syscall surface.”

For many teams, RuntimeDefault is the right tradeoff. It’s low friction, broadly safe, and rarely breaks applications. But it’s not a sandbox. It’s a compatibility-first safety net.