Linux capabilities, and how they interact with users and containers, can confuse even experienced engineers. At first it seems like this is fairly straightforward stuff, but it gets complex quickly and the information on exactly what is going on is scattered across many pages, Git repos and blogs, so it can be hard to piece together.
Things got complicated enough that I've split this blog into two pieces:
1) Capabilities: Why They Exist and How They Work
The logic and rules in 1) get more than confusing, enough so that 2) is required to properly understand what is going on. If you want to fully grasp all of this, I'd suggest reading 1) then 2) before going back to 1). And if that works you can then do 3): come and explain all of it back to me, so that I might finally understand.
Before capabilities, we only had the binary system of privileged and non-privileged processes; either your process could do everything—make admin-level kernel calls—or it was restricted to the subset of a standard user. Certain executables, which needed to be run by standard users but also make privileged kernel calls, would have the suid bit set, effectively granting them privileged access. (The typical example is ping, which was traditionally given fully privileged access to make ICMP calls.)
These executables are prime targets for hackers—if they can exploit a bug in them, they can escalate their privilege levels on the system.
This wasn't a great situation, so the kernel developers came up with a more nuanced solution: capabilities.
The idea is simple: just split all the possible privileged kernel calls up into groups of related functionality, then we can assign processes only to the subset they need. So the kernel calls were split up into a few dozen different categories, largely successfully.
Going back to the ping example, it can be given only the single CAP_NET_RAW capability, significantly decreasing the security risk. (Justin Cormack pointed out that ping doesn't actually need any capabilities, since ping sockets were added to the kernel, but it's gated by a config setting that is disabled by most Linux distributions. I've decided not to travel down the rabbit hole of why...)
Now, all this is still straightforward enough. Where the pain starts is how processes are granted privileges via files and users. The man capabilities page is about the best resource here, but pretty terse. First, a few background things that are important to understand:
Now for the complicated bit. In order to be able to assign capabilities to threads, we have the idea of ‘capability sets’. There are five sets for processes, two of which can also be applied to files.
The effective set is the set that is checked by the kernel to allow or disallow calls. The other sets control how and what capabilities get added or removed from the effective set. The other sets are inheritable, permitted, ambient, and bounding. Executables also get two of these (permitted and inheritable) as well an effective bit which can be set.
The easiest way to explain these sets is to refer to the logic that gets applied to assign capabilities to the new process on an execve call. The following is taken verbatim from the capabilities man page:
P'(ambient) = (file is privileged) ? 0 : P(ambient)
P'(permitted) = (P(inheritable) & F(inheritable)) |
(F(permitted) & cap_bset) | P'(ambient)
P'(effective) = F(effective) ? P'(permitted) : P'(ambient)
P'(inheritable) = P(inheritable) [i.e., unchanged]
P denotes the value of a thread capability set before the execve(2)
P' denotes the value of a thread capability set after the execve(2)
F denotes a file capability set
cap_bset is the value of the capability bounding set.
Let’s start with what this means for our ping example. If we put CAP_NET_RAW into the permitted set for the ping binary (F(permitted) above), it will be added to the permitted set for the process (P'(permitted)). As the ping binary is ‘capabilities aware’, it will then make a call to add the CAP_NET_RAW into the effective set.
Alternatively, if the binary hadn’t been ‘capabilities aware’, we could have set the effective bit (F(effective) above), which would have automatically added the capability into the effective set. A ‘capabilities aware’ binary is more secure, as it’s possible to limit the amount of time for which the process acquires the capability.
We can also add capabilities to the inheritable set on a file. This allows us to say ‘grant these capabilities only if they are in the executable inheritable set and also in the inheritable set for the process’, which means we can control the environments in which the executable can be used.
This makes sense, but there is a problem: when using ordinary executables without inheritable capabilities set, then F(inheritable) = 0, meaning P(inheritable) is ignored. As this is the case for the vast majority of the executables, the usability of inheritable capabilities was limited.
We can’t create a semi-privileged process tree with a subset of capabilities that are automatically inherited unless we also update the executables. In other words, even if your thread had extra capabilities, you couldn't run a helper script and let it use those capabilities unless the script was also capabilities-aware.
This situation was remedied by the addition of the ambient set, which again is inherited from the parent, but is also automatically added into the new permitted set. So now, if you're in an environment which has CAP_NET_RAW in the ambient set, the ping executable should work even if it is a ‘normal’ file (without capabilities or setuid/setgid bits set).
There are also some important rules for adding/removing capabilities from the ambient set. A capability can never be in a thread’s ambient set if it is not also in the inheritable and permitted sets. Dropping a capability from either of those sets also removes it from the ambient set. Non-root threads can add capabilities from their permitted set to the ambient set, which will allow their children to also use that capability with normal files.
The bounding and permitted sets also sound straightforward, but still hide some complexities. The bounding set is roughly intended to control which capabilities are available within a process tree. A capability can be added to the permitted set if the current thread has the CAP_SETPCAP capability and the capability is within the bounding set.
The confusion comes when we consider that the bounding set does not control the inheritable set; you can keep capabilities in the inheritable set that are not in the bounding set. If a thread has a capability in its inherited set and executes a file with the capability in its inheritable set, the resultant process will have that capability in the permitted and effective sets, regardless of whether it exists in the bounding set.
At this point, I should also explain the "securebits" flags that control how uid 0 (root) threads are handled. But it looks like it’s a bit of a mess that seems to have also evolved over time. The name of one of the flags is SECBIT_NO_SETUID_FIXUP, which doesn’t really fill me with confidence that it’s going to be a simple and elegant solution, so I’m leaving that as homework.
Some examples are in order, but I've put those in a follow-up post, where I also cover what tools are available for working with capabilities and actually get to the bit where containers are involved. And I'll complain bitterly about the complexity and lack of support.
Together with Amazic Knowledge, we are launching Kubernetes for App Developer and Kubernetes Administration, the highly-rated, most requested training courses for EVERYONE who wants to pursue CKA and CKAD certifications. Click below to sign up!