Skip to main content

Command Palette

Search for a command to run...

Linux Capabilities: A Beginner's Overview

Moving from Root Dominance to Precise Capability Allocation

Updated
12 min read
Linux Capabilities: A Beginner's Overview
M

Working as a solutions architect while going deep on Kubernetes security — prevention-first thinking, open source tooling, and a daily rabbit hole of hands-on learning. I make the mistakes, then figure out how to fix them (eventually).

Over the past few months, I’ve been testing root and non-root containers. Naturally, that led me deep into the Kubernetes securityContext: options for both pods and containers. There’s a lot packed into that field. And a handful of particularly interesting knobs that stand out:

  • Linux Security Modules (LSMs)
  • Capabilities
  • Seccomp

But here’s the thing: all of these are just Linux. Kubernetes is merely surfacing functionality the kernel has had for decades.

I’ve already gone down the LSM rabbit hole. In that post, I focused on AppArmor. And for now, I’m resisting the urge to dive into SELinux or SMACK. So let’s call that part of the journey complete for now.

Now it’s time to explore Linux capabilities. They’re surprisingly simple once you understand the model. You can dive deep in the man page. It starts like this:

“Traditional UNIX implementations distinguish two categories of processes: privileged processes (whose effective user ID is 0, referred to as superuser or root), and unprivileged processes (whose effective UID is nonzero).
Privileged processes bypass all kernel permission checks, while unprivileged processes are subject to full permission checking based on the process's credentials.”

So where do capabilities come into play?

“Starting with Linux 2.2, Linux divides the privileges traditionally associated with superuser into distinct units, known as capabilities.”

And you can see the difference right in the kernel source.
In Linux 2.0.40, there’s no mention of capabilities at all — just hardcoded UID checks like if (!suser()) return -EPERM;. But by Linux 2.2.22, fs/open.c has been rewritten to use checks like !capable(CAP_SYS_CHROOT), showing the shift to fine-grained privilege control.

So here's the reality. In modern Linux, we shouldn’t limit risk management to root alone, because privilege isn’t binary anymore. It’s been split into a collection of discrete powers, each represented by a capability. And if you don’t understand which ones your process holds, you might not fully understand what it can actually do.

For example:

  • CAP_NET_RAW: allow raw sockets
  • CAP_SYS_ADMIN: let's say a lot
  • CAP_SYS_PTRACE: ptrace() of any process
  • CAP_DAC_OVERRIDE: override all DAC access

Security should involve an understanding of what your workloads are already allowed to do, and what the kernel will or won’t stop. Let’s talk about Linux capabilities.


What Are Linux Capabilities, Really?

Before Linux had capabilities, privilege was binary: you were either root (UID 0) and could do everything. Otherwise you weren’t, and got EPERM. Every privileged syscall like mount(), ptrace(), or chown() used a hardcoded check. Capabilities don’t prevent a process from making a syscall, but they do decide whether the kernel allows it to succeed.

Deep Dive

In pre-capabilities Linux (like 2.0.40), the logic behind privileged syscalls like setgid() was straightforward. In this model, if you weren’t root, you could only change your effective group ID to match your real or saved group. Anything else returned a good old-fashioned -EPERM.

asmlinkage int sys_setgid(gid_t gid)
{
    int old_egid = current->egid;

    if (suser())
        current->gid = current->egid = current->sgid = current->fsgid = gid;
    else if ((gid == current->gid) || (gid == current->sgid))
        current->egid = current->fsgid = gid;
    else
        return -EPERM;
    if (current->egid != old_egid)
        current->dumpable = 0;
    return 0;
}

So if you were root, you could change your group IDs freely. If not, you could only change your effective group ID to match your real or saved group. Otherwise? You got a good old-fashioned -EPERM.

Fast forward to 2.2.24 and it is still simple, but just drops super user for capabilities:

asmlinkage int sys_setgid(gid_t gid)
{
    int old_egid = current->egid;

    if (capable(CAP_SETGID))
        current->gid = current->egid = current->sgid = current->fsgid = gid;
    else if ((gid == current->gid) || (gid == current->sgid))
        current->egid = current->fsgid = gid;
    else
        return -EPERM;

    if (current->egid != old_egid)
        current->dumpable = 0;
    return 0;
}

Fast forward to the newest version and it gets more complicated, so we'll avoid that.

The Root Problem

The old model had no concept of “just enough privilege.” If you wanted to let a program:

  • Bind to a port below 1024 (bind())
  • Run ping (requires raw socket access)
  • Trace another process (ptrace())

You had to give it full root, which meant total control over the system. Not ideal.

Available Capabilities

There are currently 41 capabilities in Linux defined in two sources:

Example from the header file:

/* Allow ioperm/iopl access */
/* Allow sending USB messages to any device via /dev/bus/usb */

#define CAP_SYS_RAWIO        17

/* Allow use of chroot() */

#define CAP_SYS_CHROOT       18

/* Allow ptrace() of any process */

#define CAP_SYS_PTRACE       19

These header definitions are used throughout the kernel wherever permission checks are needed. You’ll see them pop up in everything from mount() to chroot().

More Than Just On or Off

Capabilities aren’t just a binary “has it / doesn’t have it.” When a process runs, the kernel keeps several capability sets that define not only what’s possible, but what’s actually in effect. This is a high level overview of what these are.

SetWhat It Controls
PermittedThe menu of capabilities the process may make effective or inheritable.
EffectiveThe subset currently in use, what the process is actively wielding right now.
InheritableCapabilities the process can pass along to child processes during execve().
BoundingThe hard ceiling, if it’s not here, it can never be granted later, even to root.
AmbientLets certain capabilities stick around across execve() for non-root processes.

You can check a running process’s capabilities with:

grep Cap /proc/$$/status

For a friendlier view (with names instead of hex):

capsh --print

Practical Example with CAP_NET_BIND_SERVICE

Let's walk through our capabilities with low port bindings. We'll try to set a service to bind to a port below 1024. Let's start with a simple C app that we'll call bind_low_port. This app will simply try to bind, not run anything persistent. Here is the app:

// bind_low_port.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
#include <netinet/in.h>
#include <sys/socket.h>

int main() {
    int sockfd;
    struct sockaddr_in addr;

    sockfd = socket(AF_INET, SOCK_STREAM, 0);
    if (sockfd == -1) {
        perror("socket");
        return 1;
    }

    memset(&addr, 0, sizeof(addr));
    addr.sin_family = AF_INET;
    addr.sin_addr.s_addr = htonl(INADDR_ANY);
    addr.sin_port = htons(80); // Privileged port (<1024)

    if (bind(sockfd, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
        perror("bind");
        return 1;
    }

    printf("Successfully bound to port 80\n");
    close(sockfd);
    return 0;
}

Go ahead and compile that and drop it into /usr/local/bin:

sudo gcc -o /usr/local/bin/bind_low_port bind_low_port.c
sudo chmod +x /usr/local/bin/bind_low_port bind_low_port.c

Now the app is ready to go!

Run as Root

Now let's rewind to the pre-capabilities days. We need to make sure we are root to get this to work so let's create a service to run as root! Save the following as bind-root.service.

[Unit]
Description=Bind to port 80 as root

[Service]
ExecStart=/usr/local/bin/bind_low_port
User=root

[Install]
WantedBy=multi-user.target

Copy it into our systemd directory, enable it, start it, and check status:

sudo cp bind-root.service /etc/systemd/system/
sudo systemctl daemon-reexec
sudo systemctl daemon-reload
sudo systemctl start bind-root.service
sudo systemctl status bind-root

And you should see:

○ bind-root.service - Bind to port 80 as root
     Loaded: loaded (/etc/systemd/system/bind-root.service; disabled; preset: enabled)
     Active: inactive (dead)

Aug 08 20:32:52 controlplane systemd[1]: Started bind-root.service - Bind to port 80 as root.
Aug 08 20:32:52 controlplane bind_low_port[2400223]: Successfully bound to port 80
Aug 08 20:32:52 controlplane systemd[1]: bind-root.service: Deactivated successfully.

Worked as expected, but not a desirable permission level.

Run with Net Bind Capability

Now let's get to the capabilities world. We no longer need root! Save the following as a bind-captest.service.

[Unit]
Description=Bind to port 80 with CAP_NET_BIND_SERVICE

[Service]
ExecStart=/usr/local/bin/bind_low_port
User=captest
Group=captest
AmbientCapabilities=CAP_NET_BIND_SERVICE
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
NoNewPrivileges=true

[Install]
WantedBy=multi-user.target

We'll need to create this non-privileged user and group.

sudo groupadd --system captest
sudo useradd --system --no-create-home --gid captest captest

Copy it into your systemd directory, enable it, start it, and check status:

sudo cp bind-captest.service /etc/systemd/system/
sudo systemctl daemon-reexec
sudo systemctl daemon-reload
sudo systemctl start bind-captest.service
sudo systemctl status bind-captest

And you should see:

○ bind-captest.service - Bind to port 80 with CAP_NET_BIND_SERVICE
     Loaded: loaded (/etc/systemd/system/bind-captest.service; disabled; preset: enabled)
     Active: inactive (dead)

Aug 08 20:21:13 controlplane systemd[1]: bind-captest.service: Failed with result 'exit-code'.
Aug 08 20:22:13 controlplane systemd[1]: Started bind-captest.service - Bind to port 80 with CAP_NET_BIND_SERVICE.
Aug 08 20:22:13 controlplane bind_low_port[2392163]: Successfully bound to port 80
Aug 08 20:22:13 controlplane systemd[1]: bind-captest.service: Deactivated successfully

Worked as expected and with least privilege. Not too bad.

Run with No Root and No Capability

Again we'll create a service. Save the following as bind-nonroot.service.

[Unit]
Description=Bind to port 80 with no capabilities

[Service]
ExecStart=/usr/local/bin/bind_low_port
User=captest
Group=captest
NoNewPrivileges=true

[Install]
WantedBy=multi-user.target

No need to create the user and group as we did that in the previous step. Copy it into our systemd directory, enable it, start it, and check status:

sudo cp bind-nonroot.service /etc/systemd/system/
sudo systemctl daemon-reexec
sudo systemctl daemon-reload
sudo systemctl start bind-nonroot.service
sudo systemctl status bind-nonroot

And you should see:

× bind-nonroot.service - Bind to port 80 with no capabilities
     Loaded: loaded (/etc/systemd/system/bind-nonroot.service; disabled; preset: enabled)
     Active: failed (Result: exit-code) since Fri 2025-08-08 20:28:15 UTC; 11min ago
   Duration: 4ms
   Main PID: 2396701 (code=exited, status=1/FAILURE)
        CPU: 1ms

Aug 08 20:28:15 controlplane systemd[1]: Started bind-nonroot.service - Bind to port 80 with no capabilities.
Aug 08 20:28:15 controlplane bind_low_port[2396701]: bind: Permission denied
Aug 08 20:28:15 controlplane systemd[1]: bind-nonroot.service: Main process exited, code=exited, status=1/FAILURE
Aug 08 20:28:15 controlplane systemd[1]: bind-nonroot.service: Failed with result 'exit-code'.

Failed just as expected.

So we hit the trifecta. No bonus payout, sadly. Here is a quick recap:

Service TypeCan Bind Port 80?Privilege LevelNotes
RootYesFull rootRisky: has all powers
Non-root + CapabilityYesScoped via CAP_NET_BIND_SERVICEJust enough privilege
Non-root, No CapabilityNoNo elevated privilegesExpected failure (Permission denied)

Now that we've laid the groundwork, let's dive into something near and dear.


Capabilities in Container Land

Docker Containers

By default, Docker containers run with a reduced set of capabilities, but not empty. Let's create a simple Dockerfile that gives us some tools.

FROM ubuntu:22.04
RUN apt update && apt install -y \
    libcap2-bin \
    strace \
    util-linux \
    iproute2 \
    procps \
    net-tools \
    python3 \
    curl \
    && apt clean

Then create and run in interactive mode:

docker build -t captest .
docker run --rm -it captest

Now let's check our capabilities.

root@12801a8bb3a8:/# capsh --print
Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap=ep
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
Ambient set =
Current IAB: !cap_dac_read_search,!cap_linux_immutable,!cap_net_broadcast,!cap_net_admin,!cap_ipc_lock,!cap_ipc_owner,!cap_sys_module,!cap_sys_rawio,!cap_sys_ptrace,!cap_sys_pacct,!cap_sys_admin,!cap_sys_boot,!cap_sys_nice,!cap_sys_resource,!cap_sys_time,!cap_sys_tty_config,!cap_lease,!cap_audit_control,!cap_mac_override,!cap_mac_admin,!cap_syslog,!cap_wake_alarm,!cap_block_suspend,!cap_audit_read,!cap_perfmon,!cap_bpf,!cap_checkpoint_restore
Securebits: 00/0x0/1'b0
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=0(root) euid=0(root)
gid=0(root)
groups=0(root)
Guessed mode: UNCERTAIN (0)

You’ll see things like CAP_NET_BIND_SERVICE, CAP_CHOWN, and CAP_DAC_OVERRIDE still available.

Test something allowed such as CAP_CHOWN:

root@12801a8bb3a8:/# touch testfile && chown nobody:nogroup testfile && ls
bin  boot  dev  etc  home  lib  media  mnt  opt  proc  root  run  sbin  secret  srv  sys  testfile  tmp  usr  var

Test something not allowed such as CAP_NET_ADMIN:

root@12801a8bb3a8:/# ip link add dummy0 type dummy
RTNETLINK answers: Operation not permitted

Exit the container and we can try to drop and add a capability.

docker run --rm -it --cap-drop=CHOWN --cap-add=CAP_NET_ADMIN captest

And now the capabilities have switched:

root@2d6de2b45347:/# touch file
root@2d6de2b45347:/# chown nobody file
chown: changing ownership of 'file': Operation not permitted
root@54e336fcd322:/# ip link add dummy0 type dummy
root@54e336fcd322:/#

That all worked as expected. We've seen the defaults and shown how to add and drop capabilities.

Kubernetes Pods

Let's try the same thing in Kubernetes. Create the following pod definition:

apiVersion: v1
kind: Pod
metadata:
  name: cap-default
spec:
  containers:
  - name: cap-default
    image: ubuntu
    command: ["sleep", "infinity"]

Deploy the pod and exec in:

kubectl apply -f captest-pod.yaml
kubectl exec -it cap-default -- /bin/bash

Install the relevant tools and check the capabilities. They will be as expected from our Docker experiment:

root@cap-default:/# apt update && apt install -y \
    libcap2-bin \
    strace \
    util-linux \
    iproute2 \
    procps \
    net-tools \
    python3 \
    curl \
    && apt clean
root@cap-default:/# capsh --print
Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap=ep
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
Ambient set =
Current IAB: !cap_dac_read_search,!cap_linux_immutable,!cap_net_broadcast,!cap_net_admin,!cap_ipc_lock,!cap_ipc_owner,!cap_sys_module,!cap_sys_rawio,!cap_sys_ptrace,!cap_sys_pacct,!cap_sys_admin,!cap_sys_boot,!cap_sys_nice,!cap_sys_resource,!cap_sys_time,!cap_sys_tty_config,!cap_lease,!cap_audit_control,!cap_mac_override,!cap_mac_admin,!cap_syslog,!cap_wake_alarm,!cap_block_suspend,!cap_audit_read,!cap_perfmon,!cap_bpf,!cap_checkpoint_restore
Securebits: 00/0x0/1'b0 (no-new-privs=0)
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=0(root) euid=0(root)
gid=0(root)
groups=0(root)
Guessed mode: HYBRID (4)

Now let's change the pod definition to add and drop capabilities. Here is our new definition:

apiVersion: v1
kind: Pod
metadata:
  name: cap-custom
spec:
  containers:
  - name: cap-custom
    image: ubuntu:22.04
    command: ["sleep", "infinity"]
    securityContext:
      capabilities:
        drop:
        - CHOWN
        add:
        - NET_ADMIN

Deploy the pod and exec in:

kubectl apply -f cap-custom.yaml
kubectl exec -it cap-custom -- /bin/bash

Install the relevant tools and check the capabilities. They will be as expected from our Docker experiment (you might see a lot of errors due to dropping CAP_CHOWN:

root@cap-custom:/# apt update && apt install -y \
    libcap2-bin \
    strace \
    util-linux \
    iproute2 \
    procps \
    net-tools \
    python3 \
    curl \
    && apt clean
root@cap-custom:/# capsh --print
Current: cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_admin,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap=ep
Bounding set =cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_admin,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
Ambient set =
Current IAB: !cap_chown,!cap_dac_read_search,!cap_linux_immutable,!cap_net_broadcast,!cap_ipc_lock,!cap_ipc_owner,!cap_sys_module,!cap_sys_rawio,!cap_sys_ptrace,!cap_sys_pacct,!cap_sys_admin,!cap_sys_boot,!cap_sys_nice,!cap_sys_resource,!cap_sys_time,!cap_sys_tty_config,!cap_lease,!cap_audit_control,!cap_mac_override,!cap_mac_admin,!cap_syslog,!cap_wake_alarm,!cap_block_suspend,!cap_audit_read,!cap_perfmon,!cap_bpf,!cap_checkpoint_restore
Securebits: 00/0x0/1'b0
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=0(root) euid=0(root)
gid=0(root)
groups=0(root)
Guessed mode: UNCERTAIN (0)

Our quick tests showed the same rule applies whether you’re running a straight Docker container or running in Kubernetes:

  • In Docker, a container running as root starts with a limited set of default capabilities. We proved that if a capability like CAP_NET_ADMIN isn’t present, privileged operations (e.g., ip link add) fail with Operation not permitted.
  • In Kubernetes, the securityContext gives you fine-grained control. You can:
    • Drop capabilities you don’t want (e.g., CAP_CHOWN) to shrink the attack surface, even for root.
    • Add specific capabilities (e.g., CAP_NET_ADMIN) to grant only what’s needed without giving the Pod full root privileges.

Capabilities are your surgical tool for granting just enough privilege, whether that’s binding to a low port, tweaking networking, or blocking risky syscalls.


Final Thoughts: Why Capabilities Matter

From the early days of Linux, privilege was an all-or-nothing deal, UID 0 could do everything, everyone else got EPERM. That simplicity came at a cost: giving a process one privileged action meant giving it all of them.

Linux 2.2’s introduction of capabilities changed that. Privilege could now be split into fine-grained units like CAP_NET_BIND_SERVICE or CAP_SYS_PTRACE, tied directly to specific syscalls. Our NET_BIND example showed how this plays out in practice:

  • Root-only service: Works, but comes with every privilege.
  • Capability-only service: Works for the intended action (binding port 80) without extra power.
  • No capability: Fails as expected.

When we moved into containers, the same principle held:
Docker and Kubernetes both start with a reduced set of capabilities for root, and both let you add or drop individual privileges. Our tests showed:

  • Dropping CAP_CHOWN removes the ability to change file ownership, even for root.
  • Without CAP_NET_ADMIN, network device management fails with Operation not permitted.
  • Adding just the required capability restores the intended function without re-granting full root.

The takeaway: capabilities are one of the cleanest ways to shrink your attack surface without breaking legit workloads. I think of them as a firewall for syscalls. You can allow only what you need, block everything else, and never give away more than necessary.