Advancing the State of The Art of Container Storage With Titus, Part 2
Disclaimer: This blog post is a deep dive in to the topic of Linux container storage, specifically looking at Netflix’s Open Source Titus container platform. Netflix happens to be my employer, but nothing in this blog post is secret or talk about anything that isn’t already open source.
In Part 1, I discussed the current state of the art of container storage with the CSI+kubernetes, and its limitations.
In this Part 2, I’ll discuss why mounting storage is difficult in containers, especially in user namespaces are in use.
Why can’t one just
mount in a container?
If you try to
mount in a docker container, with default settings, you will get permission denied:
# docker run ubuntu mount -t tmpfs none /mnt mount: /mnt: permission denied.
Why? You are root, what other credentials do you need?
You have to be a little careful about interpreting the
permission denied error.
This error (
EPERM) is coming from the syscall itself, which you can verify using
mount("none", "/mnt", "tmpfs", 0, NULL) = -1 EPERM (Operation not permitted)
EPERM is due to seccomp.
Seccomp is a Linux kernel feature that allows one to set a policy of which syscalls are allowed to be called.
The seccomp mechanism is fine-grained, and the default policy that docker applies only allows
mount() syscalls if
CAP_SYS_ADMIN is enabled.
So let’s enable it
SYS_ADMIN, which should allow us to mount given that default policy?:
# docker run --cap-add SYS_ADMIN ubuntu mount -t tmpfs none /mnt mount: /mnt: cannot mount none read-only.
Still not working?
With strace we can reveal we get a different error,
mount("none", "/mnt", "tmpfs", MS_RDONLY, NULL) = -1 EACCES (Permission denied)
EACCES is coming from AppArmor.
AppArmor is yet another Linux security mechanism to do fine-grained syscall (as well as other things) access control.
The default docker AppArmor profile denies
So let’s disable AppArmor and keep
# docker run --cap-add SYS_ADMIN --security-opt apparmor=unconfined ubuntu mount --verbose -t tmpfs none /mnt mount: none mounted on /mnt. #
No news is good news here. The mount worked!
What about a block device?:
# docker run -ti --security-opt apparmor:unconfined --cap-add SYS_ADMIN ubuntu bash -c "dd if=/dev/zero of=/tmp/loop.img bs=1024k count=100 && mkfs.ext3 -F /tmp/loop.img && mount /tmp/loop.img /mnt/ -o loop" 100+0 records in 100+0 records out 104857600 bytes (105 MB, 100 MiB) copied, 0.132879 s, 789 MB/s mke2fs 1.46.5 (30-Dec-2021) Discarding device blocks: done Creating filesystem with 25600 4k blocks and 25600 inodes Allocating group tables: done Writing inode tables: done Creating journal (1024 blocks): done Writing superblocks and filesystem accounting information: done mount: /mnt/: mount failed: Operation not permitted.
What now though? I thought we had all the permissions setup to mount?
In this case, using strace again reveals that this time the permission has to do with
newfstatat(AT_FDCWD, "/dev/loop-control", 0x7ffe59ededb0, 0) = -1 ENOENT (No such file or direc tory) newfstatat(AT_FDCWD, "/dev/loop", 0x7ffe59edefd0, 0) = -1 ENOENT (No such file or directory) newfstatat(AT_FDCWD, "/dev/loop0", 0x7ffe59eddf40, 0) = -1 ENOENT (No such file or directory) newfstatat(AT_FDCWD, "/dev/loop1", 0x7ffe59eddf40, 0) = -1 ENOENT (No such file or directory) newfstatat(AT_FDCWD, "/dev/loop2", 0x7ffe59eddf40, 0) = -1 ENOENT (No such file or directory) newfstatat(AT_FDCWD, "/dev/loop3", 0x7ffe59eddf40, 0) = -1 ENOENT (No such file or directory) newfstatat(AT_FDCWD, "/dev/loop4", 0x7ffe59eddf40, 0) = -1 ENOENT (No such file or directory) newfstatat(AT_FDCWD, "/dev/loop5", 0x7ffe59eddf40, 0) = -1 ENOENT (No such file or directory) newfstatat(AT_FDCWD, "/dev/loop6", 0x7ffe59eddf40, 0) = -1 ENOENT (No such file or directory) newfstatat(AT_FDCWD, "/dev/loop7", 0x7ffe59eddf40, 0) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/dev/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
Any mount requiring something in
/dev/ is going to need docker’s
And you can’t just add
docker exec, it must be applied to the container at start time.
Seems like a lot of security we have to disable to make it work though.
But this is the opposite of what I want. I want more security, not less.
This is even more complex when we throw in Linux User Namespaces.
Linux user namespaces are a way mapping a UID/GIDs in a container.
For securing containers, Titus uses user namespaces to ensure the
root in the container is not root on the host.
At the time of this writing, user namespaces are not enabled by default on Docker (sometimes called “rootless docker”) or Podman.
But, if rootless mode is setup in docker, can we still run mount?:
<docker-rootless> $ docker run --cap-add SYS_ADMIN ubuntu mount --verbose -t tmpfs none /mnt
On modern kernels this works! Interestingly, we don’t even need to disable AppArmor or Seccomp, as they are not enabled by default in rootless mode… :( I’m not exactly sure why this is the case. Titus is able to have (granted, custom) Seccomp and Apparmor policies in place on containers and have user namespaces enabled.
But user namespaces only allow certain filesystems to be mounted. Per the man page:
CAP_SYS_ADMINwithin the user namespace that owns a process’s mount namespace allows that process to create bind mounts and mount the following types of filesystems:
- /proc (since Linux 3.8)
- /sys (since Linux 3.8)
- devpts (since Linux 3.9)
- tmpfs(5) (since Linux 3.9)
- ramfs (since Linux 3.9)
- mqueue (since Linux 3.9)
- bpf (since Linux 4.4)
- overlayfs (since Linux 5.11)
This is not good enough for me. I would like to be able to mount other filesystems than those.
But what if we are OK with running a command as
root (real root) outside of the container to mount storage inside container.
Why can’t we just “run
mount” for a container, even if it is rootless or has no additional privileges?
Injecting Mounts in Containers
You can run
mount on behalf of a container!
What makes the difference between mounting something inside or outside a container? How does Linux “know” that something is mounted in a container or not? The answer: “The Linux mount namespace”.
If you can enter the mount namespace first, then
mount, the mount will “be” in the container’s mount namespace:
|Outside the container||Inside the container|
While this looks like a good lead, it will not work for block devices in
Because as soon as you change into the container’s mount namespace, you no longer can see
Even if we could see dev files, or fuse, or network filesystems, the
mount wouldn’t work if we had the user namespaces on too.
Why? Because just because you are (fake) root in a container and have capabilities, they are still not enough to allow a mount (with the exception of those filesystems listed above).
The Problem/Security of User Namespaces: Namespaced Capabilities
Let’s take docker out of the picture and just use raw Linux namespaces. They are simpler to use (even if less familiar) and demonstrate things easier.
Let’s just get root:
$ whoami kyle $ unshare --user --fork --map-root-user # whoami root # capsh --print Current: =ep ...
Use user namespaces, it looks like I’m root in my micro-container, and have full capabilities!
=ep means all capabilities)
Can I mount block devices now???
unshare --user --fork --map-root-user mount /dev/sda1 /mnt/ mount: /mnt: permission denied.
No. “Of course” I cannot. This is not a limitation of docker (not in use), has nothing to do with seccomp, apparmor, or even file permissions!
In fact it is kinda the opposite: I’m in my own “container”, but the only thing I’ve containerized is UIDs.
You can even
chmod 777 /dev/sda1 and it won’t make a difference!
But could we pull the same
nsenter trick on a running container?
|Outside the container||Inside the container|
It didn’t work when we tried to
nsenter with user namespaces.
It did work when we omitted user namespaces, but then the container couldn’t
Why couldn’t we “just” mount it on the left hand side with
We have real root on the left hand side, why didn’t it work?
User namespaces, Capabilities, and filesystems
The real reason that mounting in user namespaced containers works sometimes, but not others, is that only certain filesystems are
For example, only starting in kernel version 5.11 was overlayfs audited and blessed to be mounted in an unprivileged (user namespaced) container.
Still, we are talking only a handful of filesystems that have this capability.
Is there any way to mount storage in an unprivileged container??? Yes, with some tricks.
In Part 3 I’ll discuss how Titus (
titus-storage) is able to separate the attaching of storage from the container lifecycle (how to attach storage after a container is running), all while respecting all four linux namespaces, and while keeping the container completely unprivileged.
[ Part 1 | Part 2 | Part 3 | Part 4 ]Comment via email