Advancing the State of The Art of Container Storage With Titus, Part 4

1 May, 2022

Disclaimer: This blog post is a deep dive in to the topic of Linux container storage, specifically looking at Netflix’s Open Source Titus container platform. Netflix happens to be my employer, but nothing in this blog post is secret or talk about anything that isn’t already open source.

In Part 1, I discussed the current state of the art of container storage with the CSI+kubernetes, and its limitations.

In Part 2, I the problem of mounting storage inside running containers, especially using user namespaces.

In Part 3, I talked about the mount binaries Titus uses to mount storage in containers at runtime using new Linux mount apis and SCM_RIGHTS.

In this Part 4, I’ll talk about how we are able to do all that, and still ensure that all the storage is ready by the time a user’s workload starts up.

Getting Storage Right, Before a Workload Starts

One consequence of building storage primitives that depend on the containers (namespaces) existing, is that it means we have to mount the storage after the PID1 of the container exists.

How can we ensure that the storage is ready first? Easy: Control the PID1! Analogous to Fly.io’s init, Titus also injects its own PID1 into every container.

The Titus PID1 (titus-init) just starts up, and then waits for the signal to go. The orchestration on process (titus-executor) can do things like storage, while the PID1 is waiting:

sequenceDiagram participant TE as titus-executor participant TS as titus-storage participant C as container participant T as titus-init participant U as user process activate TE TE->TE: Open Unix Domain Socket rect rgb(191, 223, 255) Note over T,C: titus-init (tini) is pid1
of the container activate C activate T T->>TE: Connect to socket, wait for instruction TE->>TS: Start storage activate TS TS->>C: Mount storage in the container TS->>TE: Storage complete deactivate TS TE->TE: Setup logging,
auxilary services,
inject sidecars TE->>T: Launch the workload! (over socket) TE->TE: Close Unix Domain Socket T->>U: exec the original command end deactivate T activate U U->U: Start deactivate U deactivate C deactivate TE %%end

Enabling Storage and Other System Services

With the workload paused, Titus can launch much more than just storage.

See this KubeCon talk on how Titus inject lightweight sidecars into containers, without adjusting the pod spec.

Handling Stdout/Stderr

Handling Stdout/Stderr of a container process is one of the first quality-of-life improvements that a container platform can handle.

Life is just so much better when you don’t have to fight for the stdout/stderr of a crashed container!

k8s doesn’t exactly make it super easy.

Having control of PID1 of the container means we can easily handle the stdout/stderr file descriptors and ensure the logs are handled “out of band”, and are never lost.

This is easily accomplished by simply dup2‘ing the stdout/stderr file descriptors into real files that can outlive the container.

Enabling systemd

Supporting systemd is a killer feature for Titus. systemd requires a lot of additional configuration to work in a container. Luckily our Titus users do not need to know about that. But, systemd must also be PID1 when it runs. For Titus this is easy, if we detect that systemd will be running, titus-init simply execs into systemd, allowing it to take over PID1 duties. Otherwise, titus-init stays in place, providing a true PID1-compliant process, so that Titus users don’t need to provide their own.

Enabling Custom Seccomp Policies

Having control over PID1 in the container also means we can install very special Seccomp policies that will apply to every future process for that container. But, we already have Seccomp and Apparmor, why would we want even more? In this case, we are not using Seccomp to restrict syscalls, we are using the new seccomp-notify method to enable new syscalls that can be intercepted by the supervisor. Here those syscalls can be handled safely, and in user-space. This is for things like the perf or bpf syscalls, which would otherwise be too powerful to give to a container. Eventually such configuration will be part of the OCI spec.

Conclusion

Having control over the PID1 of every container on a container platform is a huge point of technical leverage. I would highly encourage anyone building a container platform to invest in this control. It is useful for way more than just “pausing” the workload for storage purposes. It also can enable fine-grained control over process order, the Russian Doll technique.

Series Conclusion

Titus is engineered to have cake and eat it too:

Advanced storage, beyond what the CSI can support
User namespaces, Seccomp, Apparmor, above and beyond container security for multi-tenant environments, beyond what kubelet can support
Fine-grained container ordering, systemd containers, and PID1 injection

These are hard, but not impossible requirements to meet, and all open source.

I look forward to continuing to be on the cutting edge of container technology as we aim to meet new challenges.

[ Part 1 | Part 2 | Part 3 | Part 4 ]

Comment via email