A Tale of Container UIDs
Table of Contents
The Question
If a process running inside a container as user ID 1000 creates a file on a shared mount, what user ID will be shown as the file’s owner when viewed from outside the container?
Example
Assume we have the following Containerfile:
FROM debian:bookworm
RUN useradd --uid 1000 nonroot
USER nonroot
And then I run the following commands:
podman build . -t debian:bookworm-nonroot
podman run --volume /tmp:/tmp debian:bookworm-nonroot touch /tmp/x
What will be shown as the owner user ID when running the following command?
ls -l /tmp/x
Background
Before discussing the solution to this particular problem, I want to ensure that you have the all the prerequisite knowledge used in the answer.
If you feel like you already understand the subjects discussed in the following sections, you can skip to the Revisting The Question Section.
Containers
Linux Containers have emerged as a key open source application packaging and delivery technology, combining lightweight application isolation with the flexibility of image-based deployment methods.
Several components are needed for Linux Containers to function correctly, most of them are provided by the Linux kernel. Kernel namespaces ensure process isolation and cgroups are employed to control the system resources.
- Overview of Containers in Red Hat Systems: Chapter 1. Introduction to Linux Containers1
This is a high level overview of containers. For our purposes, we need to treat containers a little bit differently.
In our eyes, containers are simply processes. These processes are then isolated from each other, and from the host, using linux namespaces
Linux Namespaces
A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource.
- namespaces(7) - Linux Man Pages2
Linux namespaces are isolation mechanisms used to isolate processes from certain resources accessible through the linux kernel.
There are a several namespace types, but of particualr importance to containers are:
- PID namespaces
- Mount namespaces
- Network namespaces
- User namespaces
This post, as foreshadowed by the The Question, is going to discuss the User Namespace.
User Namespace
User namespaces isolate security-related identifiers and attributes, in particular, user IDs and group IDs, the root directory, keys, and capabilities.
- user_namespaces(7) - Linux Man Pages3
Processes within a user namespace are unaware of the user IDs/group IDs used on the host. What does a user ID/group ID within a user namespace look like to an outside observer?
An important addition is made in the next paragraph:
A process’s user and group IDs can be different inside and outside a user namespace. In particular, a process can have a normal unprivileged user ID outside a user namespace while at the same time having a user ID of 0 inside the namespace.
- user_namespaces(7) - Linux Man Pages3
If user IDs/group IDs inside and outside a user namespace don’t have to match, what is the relationship between them?
uid_map and gid_map
user ID/group ID mapping is the act of mapping user IDs/group IDs within a user namespace to user IDs/group IDS outside of it.
By default, processes within a user namespace use the identity mapping, meaning a user ID/group ID inside the namespace is equivalent to the same user ID/group ID on the host.
Namespace user ID | Host user ID |
---|---|
0 | 0 |
1 | 1 |
2 | 2 |
. . . | . . . |
In this case, if a process running under a user with user ID 2, using the above uid_map, creates a file, a user outside the namespace will see the file as owned by user ID 2.
However, you don’t have to use the default mapping. user ID/group ID mappings can be modified through the files /proc/<pid>/uid_map and /proc/<pid>/gid_map.
In these files, each line is of the following format:
<starting-namespace-uid/gid> <starting-host-uid/gid> <count>
Each line gives a range of consecutive user IDs/group IDs to map from within the namespace to the host. For example, the following uid_map:
0 1000 3
Results in the following mapping between user IDs in the namespace to the host:
Namespace UID | Host UID |
---|---|
0 | 1000 |
1 | 1001 |
2 | 1002 |
In this case, if a process running under a user with user ID 2, using the above uid_map, creates a file, a user outside the namespace will see the file as owned by user ID 1002.
Any user ID/group ID not found in the uid_map/gid_map files, uses the deafult identity mapping. For example, in the same namespace, if a process run by a user with user ID 3 creates a file, then a user outside the namespace will see it is owned by user ID 3, not 1003, since it was not mapped.
Permissions
Not every process can modify another process’s uid_map/gid_map, or even its own; several restrictions apply, which are outlined below:
One of the following two cases applies:
- Either the writing process has the CAP_SETUID/CAP_SETGID capability in the parent user namespace.
- No further restrictions apply: the process can make mappings to arbitrary user IDs (group IDs) in the parent user namespace.
- Or otherwise all of the following restrictions apply:
- The data written to uid_map/gid_map must consist of a single line that maps the writing process’s effective user ID/group ID in the parent user namespace to a user ID/group ID in the user namespace.
- The writing process must have the same effective user ID as the process that created the user namespace.
user_namespaces(7) - Linux Man Pages3
Let’s break it down:
Either the writing process has the CAP_SETUID/CAP_SETGID capability in the parent user namespace.
CAP_SETUID/CAP_SETGID are capablities4 that give a process the ability to modify another process’s uid_map/gid_map. So any process that has the CAP_SETUID/CAP_SETGID capability, can modify any other process’s uid_map/gid_map so long as that other process is within his user namespace.
If the process doesn’t have those capabilities, all of the following restrictions must apply:
The data written to uid_map/gid_map must consist of a single line that maps the writing process’s effective user ID/group ID in the parent user namespace to a user ID/group ID in the user namespace.
<child-namespace-uid> <parent-namespace-uid> 1
The above is the only valid value that can be written to a process’s uid_map. It means that we can only map our current user ID to some user ID inside the namespace, and that’s it.
The writing process must have the same effective user ID as the process that created the user namespace.
If I don’t have the required capabilties, I can only modify a process’s uid_map/gid_map with the above values if I was the one that created its namespace.
subuids and subgids
Due to these heavy restrictions, and the need to avoid vulnerabilities by ensuring that every user ID/group ID in the namespace is mapped to a non-root user ID/group ID, container engines like podman came up with a mechanism called subuids and subgids.
subuids and subgids allow for the system administrator to delegate user ID/group ID ranges to a non-root user. This enables the non-root user to map more than 1 user ID/group ID from the namespace, even though they don’t have the CAP_SETUID/CAP_SETGID capabilities.
These are configured using the files /etc/subuid and /etc/subgid, in which each line has the following format:
<username>:<starting-uid/gid>:<count>
This allows <username> to create a container process with the following uid_map/gid_map.
0 <starting-uid/gid> <count>
For example, the following /etc/subuid:
tomerh:100000:65536
Allows tomerh to create a container process with the following uid_map:
0 100000 65536
However, if you’ll recall, in the Permissions section, I said that unless we have the CAP_SETUID/CAP_SETGID capabilities, we can only map ourselves into the namespace. We certainly don’t have CAP_SETUID/CAP_SETGID, so what gives? How can we suddenly map 65536 users into the namespace?
Well, we can’t do that, but container engines can, due to a little trick called file capabilities.
Container engines utilize 2 binaries called newuidmap/newgidmap, that have the CAP_SETUID/CAP_SETGID capabilities. These binaries read /etc/subuid and /etc/subgid, verify that you have enough user IDs/group IDs to map all of the user IDs/group IDs inside the container, and modify the container process’s uid_map/gid_map.
Revisting The Question
Now that we covered the background required, we can revisit The Question. As we’ve discussed previously, we know that in all possible cases, there is some kind of user ID mapping. Whether it is the identity mapping as is the default when running rootful, or some other mapping configured by the system administrator/container engine.
In order to further illustrate this, we’ll try some concrete examples. In all examples we’ll use the same image built in the original Example. Try and see if you can correctly guess the output of the commands
Rootful Default Mapping
[root@fedora ~]$ podman run --volume /tmp:/tmp debian:bookworm-nonroot touch /tmp/x
Output
[root@fedora ~]$ ls -l /tmp/x
-rw-r--r-- 1 1000 1000 0 Jan 31 23:45 /tmp/x
Since by default rootful containers use the identity mapping, it is expected that the user ID will stay the same.
Rootful Custom Mapping
[root@fedora ~]$ podman run --uidmap=0:100000:65536 -v /tmp:/tmp debian:bookworm-nonroot touch /tmp/x
The option --uidmap receives the same paramters as uid_map except delimited by colons and not whitespace. It is easy to see why the user ID is 101000 if we look at a table representation of the user ID mapping: However, why was the group ID 101000? We didn’t use any group ID mapping, so it should have been 1000, shouldn’t it?
In these kinds of cases, where the user ID/group ID on host don’t match our expectations, we should look at the /proc/<pid>/uid_map and /proc/<pid>/gid_map files: When --gidmap isn’t specified, podman uses --uidmap’s value for it. The opposite is true as well, when --uidmap isn’t specified, podman uses --gidmap’s value for it.Output
[root@fedora ~]$ ls -l /tmp/x
-rw-r--r-- 1 101000 101000 0 Jan 31 23:45 /tmp/x
Container UID Host UID 0 100000 1 100001 2 100002 . . . . . . 1000 101000 . . . . . . 65535 165535 [root@fedora ~]$ podman run --uidmap=0:100000:65536 -d --name rootful debian:bookworm-nonroot sleep infinity
[root@fedora ~]$ cat /proc/$(podman inspect rootful | jq '.[0].State.Pid')/gid_map
0 100000 65536
Rootless Default Mapping
[tomerh@fedora ~]$ cat /etc/subuid
tomerh:100000:65536
[tomerh@fedora ~]$ cat /etc/subgid
tomerh:100000:65536
[tomerh@fedora ~]$ id -u
501
[tomerh@fedora ~]$ podman run -v /tmp:/tmp debian:bookworm-nonroot touch /tmp/x
Output
[root@fedora ~]$ ls -l /tmp/x
-rw-r--r-- 1 100999 100999 0 Jan 31 23:45 /tmp/x
Huh, that’s not what I was expecting. Shouldn’t it have been 101000?
Let’s take a look at /proc/<pid>/uid_map:
[tomerh@fedora ~]$ podman run -d --name rootless debian:bookworm-nonroot sleep infinity
[tomerh@fedora ~]$ cat /proc/$(podman inspect rootless | jq '.[0].State.Pid')/uid_map
0 501 1
1 100000 65536
Podman maps our user ID to the container’s root user ID, and then maps all the other user IDs sequentially, according to our subuids and subgids. So the following mapping was used:
Container UID | Host UID |
---|---|
0 | 501 |
1 | 100000 |
2 | 100001 |
. . . | . . . |
1000 | 100999 |
. . . | . . . |
65535 | 165534 |
And that’s why the command showed 100999 instead of 101000. That still doesn’t answer why podman maps our user ID into the container. Taking a look at the documentation, the following section appears relevant:
If --userns is not set, the default value is determined as follows.
- If --pod is set, --userns is ignored and the user namespace of the pod is used.
- If the environment variable PODMAN_USERNS is set its value is used.
- If userns is specified in containers.conf this value is used.
- Otherwise, --userns=host is assumed.
userns-mode - Podman Docs[^5]
And below that there’s the following table, including all the possible values of --userns, and what mapping they use for the user’s user ID:
Key | Host UID | Container UID |
---|---|---|
auto | $UID | nil |
host | $UID | 0 |
keep-id | $UID | $UID |
keep-id:uid=200,gid=210 | $UID | 200 |
nomap | $UID | nil |
Since none of the conditions in the list apply to us, --userns=host is assumed, which means that our user ID is mapped to the root user ID, as seen in the table. If we want to change this, we’ll have to pick another mode.
Rootless Custom Mapping
[tomerh@fedora ~]$ cat /etc/subuid
tomerh:100000:65536
[tomerh@fedora ~]$ cat /etc/subgid
tomerh:100000:65536
[tomerh@fedora ~]$ id -u
501
[tomerh@fedora ~]$ podman run --uidmap=0:0:65536 -d --name rootless debian:bookworm-nonroot sleep infinity
Output
[root@fedora ~]$ ls -l /tmp/x
-rw-r--r-- 1 100999 100999 0 Jan 31 23:45 /tmp/x
Rather unexpectedly, using --uidmap doesn’t actually change the mapping podman uses compared to the Rootless Default Mapping, as can be seen in the uid_map:
[tomerh@fedora ~]$ podman run --uidmap=0:0:65536 -d --name rootless debian:bookworm-nonroot sleep infinity
[tomerh@fedora ~]$ cat /proc/$(podman inspect rootless | jq '.[0].State.Pid')/uid_map
0 501 1
0 100000 65535
If you have a keen eye, you may have noticed that I used --uidmap=0:0:65536 and not --uidmap=0:100000:65536. This is because in rootless mode, podman seperates the user and group ID mapping into 2 steps, that look like this:
Container UID | Intermediate UID | Host UID |
---|---|---|
0 | 0 | 501 |
1 | 1 | 100000 |
2 | 2 | 100001 |
. . . | . . . | . . . |
1000 | 1000 | 100999 |
. . . | . . . | . . . |
65535 | 65535 | 165534 |
And these mapping steps can be controlled independently:
- --uidmap(and --gidmap), which can be used to control the mapping between the container user ID and the intermediate user ID. That is the reason we used 0:0:65536 and not 0:100000:65536, since we want to map the 0th container user ID to the 0th intermediate user ID, etc.
- --userns, which can be used to control the mapping between the intermediate user ID and the host user ID.
As I mentioned in Rootless Default Mapping, --userns=host is used by default, which causes this mapping:
Intermediate UID | Host UID |
---|---|
0 | 501 |
In order to keep that from happening, we must change our --userns mode.
The table in the Rootless Default Mapping points us in the direction of either --userns=auto or --userns=nomap:
auto: Automatically create a unique user namespace. The users range from the /etc/subuid and /etc/subgid files will be used.
nomap: Creates a user namespace where the current rootless user’s user ID and group ID are not mapped into the container.
userns-mode - Podman Docs[^5]
From their description I’d wager we want to use auto, but let’s try both and see how they’re different:
[tomerh@fedora ~]$ podman run --userns=auto -d --name rootless debian:bookworm-nonroot sleep infinity
[tomerh@fedora ~]$ cat /proc/$(podman inspect rootless | jq '.[0].State.Pid')/uid_map
0 100000 1024
[tomerh@fedora ~]$ podman run --userns=nomap -d --name rootless debian:bookworm-nonroot sleep infinity
[tomerh@fedora ~]$ cat /proc/$(podman inspect rootless | jq '.[0].State.Pid')/uid_map
0 100000 65536
The difference between --userns=nomap and --userns=auto is the default size of the mapping. While --userns=nomap uses all available subuids and subgids, --userns=auto tries to use only as much as needed. In addition, while --userns=nomap isn’t configurable, --userns=auto is. Interestingly, in our case they can be made identical by using --userns=auto:size=65536.
When checking the result of our Question with each of the above options, we can see that the results are the same:
[tomerh@fedora ~]$ podman run --rm -v /tmp:/tmp --userns=auto debian:bookworm-nonroot touch /tmp/auto
[tomerh@fedora ~]$ podman run --rm -v /tmp:/tmp --userns=auto:size=65536 debian:bookworm-nonroot touch /tmp/auto-size
[tomerh@fedora ~]$ podman run --rm -v /tmp:/tmp --userns=nomap debian:bookworm-nonroot touch /tmp/nomap
[tomerh@fedora ~]$ ls -la /tmp/{auto,auto-size,nomap}
-rw-r--r-- 1 101000 101000 0 Feb 7 13:11 /tmp/auto
-rw-r--r-- 1 101000 101000 0 Feb 7 13:11 /tmp/auto-size
-rw-r--r-- 1 101000 101000 0 Feb 7 13:11 /tmp/nomap
Conclusion
As can be seen from the numerous examples we’ve covered, even when knowing the background it is hard to predict what user and group ID will actually be used when creating files inside containers, especially so with rootless containers, due to idiosyncrasies in the various container engines.
Fortunately, it is very easy to check what mapping is used, so if you encounter any issues with user and group IDs of files on the host created by a container not matching what you expect, remember to check the uid_map/gid_map!
https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_atomic_host/7/html/overview_of_containers_in_red_hat_systems/introduction_to_linux_containers#overview ↩︎
https://man7.org/linux/man-pages/man7/user_namespaces.7.html ↩︎ ↩︎ ↩︎
https://man7.org/linux/man-pages/man7/capabilities.7.html ↩︎