Security Features of Apptainer vs. Rootless Podman: Part 2

Dave GodloveOctober 17, 2023

This is the second blog post in a 3-part series that compares and contrasts the security features of Apptainer and Rootless Podman. Part 2 focuses on implicit and explicit use of the User Namespace to set up containers and provide root access.

In part 1 of this blog post miniseries, we talked about the history of Apptainer and Podman to understand how their development goals affected their security stance. We also provided some background on Linux Namespaces. Now let’s think about the User Namespace and the various ways that these container platforms use (or don’t use) it.

Implicit use of the User Namespace for container creation

Apptainer: provided unprivileged containers before it was popular

The User Namespace is great… if your kernel supports it and if it is enabled on your system. Unfortunately, the User Namespace was not in widespread use within the HPC community when Apptainer made it’s debut. So Apptainer needed a different approach to grant unprivileged users access to containers.

The strategy that Apptainer adopted before popular adoption of the User Namespace was based on 3 points:

At runtime, append entries into /etc/passwd and /etc/groups for the calling user so that they are the same user (with the same UID, GID, permissions, etc.) inside the container and outside the container.
Mount the new container file system with the nosuid option and kick off the containerized process with the PR_SET_NO_NEW_PRIVS kernel flag so that the user can't gain any extra privileges once they are inside the container.
Include a root-owned suid program with the Apptainer installation to selectively elevate privileges during container setup. This allows an unprivileged user to do things like mount the container file system and set up the mount namespace. Once the privileged operations are completed on the user’s behalf, permanently drop privileges and spawn the containerized process(es).

Observe this demonstration that illustrates point 1 and 2 above.

[demouser@demobox ~]$ apptainer shell docker://godlovedc/sudo
INFO:    Using cached SIF image

Apptainer> tail -n 1 /etc/passwd        
demouser:x:1001:1001::/home/demouser:

Apptainer> sudo whoami
sudo: /etc/sudo.conf is owned by uid 65534, should be 0
sudo: The "no new privileges" flag is set, which prevents sudo from running as root.
sudo: If sudo is running in a container, you may need to adjust the container configuration to disable the flag.

This strategy provides a safe way for unprivileged users to run containers without gaining any additional privileges. The main drawback is that the suid workflow puts the responsibility for security on the Apptainer developers.

Now back to the User Namespace. It has been possible to install Apptainer without the suid starter process since early 2019 and to leverage the User Namespace for privilege escalation, but the list of supported features was initially pretty small. However, the great advantage of installing Apptainer in this way is that it shifts the onus for secure programming from the Apptainer developers to the Linux kernel developers. This gets more eyes on the security-critical code. 👀

Over time, Apptainer has begun to rely more on the User Namespace (and other tools like fuse) to allow users to carry out the operations needed to set up containers without privilege. In September 2022, the Apptainer developers decided to change the default installation procedure to the non-suid workflow and to force admins to explicitly specify if they want to install the legacy suid process. This decision was made because the Apptainer feature set had expanded to cover most use cases with the non-suid workflow, the User Namespace was judged to have become a stable feature with fewer vulnerabilities being reported, and most HPC sites either had User Namespaces enabled or at least had the ability to enable them within their environment.

Going back to Apptainer’s original strategy for providing containers to unprivileged users, it is important to note that points 1 and 2 above are still implemented in non-suid Apptainer. So even when Apptainer is leveraging the User Namespace to get your container started, it will still append your UID and GID entries into /etc/passwd and /etc/groups and it will still prevent you from escalating privilege inside the container. This explains why you cannot leverage a suid-bit program like sudo within Apptainer, even if you are running the non-suid version under the User Namespace.

Since the shift to User Namespaces, the features supported by Apptainer in non-suid mode have continued to expand. Most recently, it is now possible to use gocryptfs to encrypt containers at build time without privilege (more on that later). One notable exception to this general trend is network configuration. Doing any useful work within the Network Namespace (like exposing ports) requires real privilege on the host system in both suid and non-suid mode. But in practice, this is not a problem for the vast majority of HPC users since Apptainer just uses the host network without any Namespace abstraction by default.

Podman: early adopter of the User Namespace workflow

Since its inception, the one and only mechanism that Podman has used to provide containers to unprivileged users has been the User Namespace. So you can either run Podman as a privileged user in "rootful mode" (not great for security) or leverage the User Namespace in "rootless mode."

Because Podman has never had a mechanism for providing unprivileged users access to containers outside of leveraging the User Namespace, it does not have some of the legacy bits and pieces that Apptainer has. Podman does not ship with any suid workflow bits. And it does not mount the containerized file system with the nosuid option or set the PR_NO_NEW_PRIVS kernel flag on containerized processes. The practical result of these differences is that suid programs can be used to elevate privileges within Podman containers. This is not a security concern because the containers run within a new User Namespace and the kernel knows the difference between root on the host system and root within a Namespace. So you don’t have to worry about a user elevating privileges within a container and re-writing host system configuration in /etc, for instance. On the negative side, you may view this as fewer layers of (redundant) security.

Explicit use of the User Namespace to become a different user within the container

Up until now, we have discussed the ways in which container runtimes leverage the User Namespace implicitly on your behalf to set the container up. But a user can also leverage the User Namespace explicitly to change their UIDs and GIDs inside the container. This is often done to become root (or to pretend to be root) inside of the container.

Apptainer: root on demand

By default, Apptainer maps your UID and GID on the host to the same UID and GID in the container. As we discussed above, this has nothing to do with Namespaces and is accomplished simply by making entries in the appropriate files in /ect and bind mounting your home directory into the container. But it’s possible to change your UID and GID to something else using the User Namespace.

Both the legacy suid version of Apptainer and the newer non-suid version allow users to explicitly leverage the User Namespace (if it is present on the host and properly configured) to change their UIDs inside the container. Most of the time, the user wants to become root inside the container. So there is a convenience option called --fakeroot that allows you to map your UID on the host to UID==0 inside the container:

[demouser@demobox ~]$ apptainer shell --fakeroot docker://alpine
INFO:    Using cached SIF image

Apptainer> id
uid=0(root) gid=0(root) groups=65534(nobody),65534(nobody),65534(nobody),0(root)

Starting with Apptainer v1.1.0, the --fakeroot option is silently implied if you attempt to build a container without privilege.

If you want to change your UID and GID to something other than root you can also do so explicitly through the User Namespace. The syntax is a bit different in this case:

[demouser@demobox ~]$ apptainer shell --security uid:3210,gid:3210 docker://alpine
INFO:    Using cached SIF image

Apptainer> id
uid=3210 gid=3210 groups=3210

Podman: your normal UID on demand (through the User Namespace)

Podman takes pretty much the exact opposite approach from that of Apptainer. By default, Podman enters a new User Namespace when you run a container and automatically maps your UID and GID to 0 (along with adding you to a bunch of other groups, at least in this example).

[demouser@demobox ~]$ podman run -ti alpine /bin/sh

/ # whoami
root

/ # id
uid=0(root) gid=0(root) groups=0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel),11(floppy),20(dialout),26(tape),27(video)

If you want to be the same user inside and outside of the container, there is a convenience flag that you can pass for that. But this is still accomplished by UID and GID mapping within a new User Namespace.

[demouser@demobox ~]$ id
uid=1234(demouser) gid=1234(demouser) groups=1234(demouser),10(wheel) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023

[demouser@demobox ~]$ podman run --userns=keep-id -ti alpine /bin/sh

~ $ id
uid=1234(demouser) gid=1234(demouser) groups=1234(demouser)

So, in the end, you can still be either root or yourself inside of a container using Apptainer or Podman, but the defaults are opposite, and the way you keep your UID in the container involves the User Namespace with Podman and config files with Apptainer. 😸

Unprivileged Installation of the Runtime Itself

Since we can run containers completely without privilege, it stands to reason that we should be able to install the container runtime itself completely without privilege.

Apptainer: community support for installation without root

Since Apptainer release 1.1.4, there has been a convenience script that allows users (on several different distributions) to easily install a relocatable version of Apptainer in the directory of their choice. This means that unprivileged users no longer need to rely on an admin to install and configure Apptainer on their behalf.

Although non-suid Apptainer is a very secure way to run containers, this development also has some (potentially unexpected) security considerations. Before the non-suid version, Apptainer had to be installed with elevated privileges. At that time, the main configuration file (apptainer.conf) was required to be owned by the root user with secure permissions or Apptainer would refuse to run. As an administrator, it is important to realize that an unprivileged user can now circumnavigate the main Apptainer configuration files by installing and running their own version. This may be a security concern, for instance, if you are using the Execution Control List (ECL) feature. ECL allows you to specify which containers are allowed or prevented from running on your system based on their cryptographic signatures (more on these later). If you rely on this feature, you need to disable User Namespaces on your host to prevent unprivileged users from installing their own versions of Apptainer and running untrusted containers.

Podman: theoretically possible but technically challenging

It seems like it should be possible to install Podman as an unprivileged user since it can be run without any privileges through the User Namespace. But there is currently no simple(-ish) way to do so. Reviewing the Podman installation documentation shows that you can build Podman from source. This would be a good starting point for an unprivileged installation. Unfortunately, however, there are a lot of dependencies that must be installed first and the instructions have those being installed via a package manager. So you would need to track down all of these dependencies and either build them from source and install them in a custom location or perhaps try to grab the correct RPMs for your system and extract them to a unprivileged location. (This is essentially what the Apptainer convenience script for unprivileged installations does for you.) In my hands, the Podman installation bits also don’t seem to respect the --prefix option during configuration. This throws another wrinkle in the installation method since you would need to manually move the compiled executables to the location of your choice instead of simply running a make install.

If you are like me, you might be wondering if you can install Apptainer without privilege and then build a container with Podman in it. I spent a little bit of time trying to get this to work and met with a lot of obstacles.

[demouser@demobox ~]$ apptainer shell --bind /etc/subuid,/etc/subgid,var:/var,run:/run podman.sif 
Apptainer> podman run --network=host --cgroups=disabled -ti alpine /bin/sh
WARN[0000] The cgroupv2 manager is set to systemd but there is no systemd user session available 
WARN[0000] For using systemd, you may need to login using an user session 
WARN[0000] Alternatively, you can enable lingering with: `loginctl enable-linger 1001` (possibly as root) 
WARN[0000] Falling back to --cgroup-manager=cgroupfs    
WARN[0000] "/" is not a shared mount, this could cause issues or missing mounts with rootless containers 
ERRO[0000] running `/usr/bin/newuidmap 1241297 0 1001 1 1 165536 65536`: newuidmap: write to uid_map failed: Operation not permitted 
ERRO[0000] invalid internal status, try resetting the pause process with "podman system migrate": cannot set up namespace using "/usr/bin/newuidmap": should have setuid or have filecaps setuid: exit status 1

This fails because Apptainer blocks the ability to run suid programs like newuidmap (which is a helper program allowing us to properly configure the User Namespace). The obvious solution is to use --fakeroot to enter the container.

After some trial and error, I arrived at this series of command:

[demouser@demobox ~]$ mkdir var run

[demouser@demobox ~]$ apptainer shell --fakeroot --bind var:/var,run:/run podman.sif 

Apptainer> podman --runtime crun run -v /sys:/sys --network=host --cgroups=disabled -ti alpine /bin/sh

/ # cat /etc/os-release 
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.18.3
PRETTY_NAME="Alpine Linux v3.18"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://gitlab.alpinelinux.org/alpine/aports/-/issues"

As you can see, I needed to create new var and run directories that I can bind mount into the container to make them writable. The --fakeroot option will make Podman run in “rootful“ mode (i.e., it won’t try to use the User Namespace). Once in the container, I have to specify that I want to use crun (instead of runc) as the container runtime. I am bind-mounting the sys directory because it is needed and this is a directory that is bind-mounted by default into Apptainer. So this is the host version of /sys. I also need to disable the Network and Cgroup Namespaces since they will not run properly without real privilege and access to systemd respectively. After all of that, the container runs! 😅

As you can see this works, but there are a lot of features that are disabled/missing when running Podman nested in Apptainer like this. Still, if you really need an easy-ish way to install and run Podman without privileges this is (maybe) better than nothing.

Whew! Maybe we should take a break to rest and digest. In the next episode of our blog post miniseries, we’ll discuss several other topics like cryptographically signing and verifying containers, creating and running encrypted containers, and a few other miscellaneous, security-adjacent topics. See you, space cowboy… 🚀 🪐 🤠