Short Note About Container Escapes, Namespaces, and Filesystem Roots

Very recently h0mbre published a very detailed writeup about his attempt to exploit an n-day in Google’s kernelCTF. As part of this, the exploit must escape the container environment.

Container environments are typically based on Linux Namespaces. Every namespace can give a different perspective on OS resources and in particular they allow to give a different view of the filesystem. Container solutions can use that behaviour to give the impression of a different filesystem root while it actually is a subfolder on the host’s filesystem (compare to chroot(1)).

h0mbre noted at the end of his writeup that he struggled shortly to read the actual flag from the host after the exploit escalated privileges perfectly and the nsproxy member of the exploit’s task_struct was successfully overwritten. The exploit still had the filesystem root of the container. The reason for this is that the exploit still lives in the container’s namespace. h0mbre therefore opted to call setns(2) in order to change the mount namespace to the one of the host’s init process which changes the filesystem root to that of the host.

I wrote myself about the very same topic in the blog of my current employer. However, I choose to overwrite the fs member of the exploit’s task_struct. That worked out as well.

The reason why both solutions work equivalent can be found in the function mntns_install. It is called by the setns(2) syscall which sets the namespaces of a thread according to the procfs-namespace symbolic link given as the first argument. It calls the functions set_fs_pwd and set_fs_root in turn which set the current directory and filesystem root according to the linked namespace. Because the cred and nsproxy member of the exploit’s task_struct are already overwritten, using the /proc/self/ns/mnt link for the mount namespace will set the mount namespace to the one of the host’s init process and accordingly set the fs object to have the host’s filesystem root.

Simply overwritting the fs member in task_struct just has the same effect.

Which solution is better depends probably on the exploit and available primitives. In a ROP chain scenario using setns(2) is probably easier as the exploit could just jump to the function instead of collection a lot of ROP gadgets to overwrite fs in task_struct. On the other side, in one PoC exploit I experimented with, the arbitrary write primitive turned out unreliable for overwriting complete pointers. Depending on the primitive, the signal handlers after nsproxy could be corrupted, too, again resulting in a very unstable exploit (h0mbre described that particular case as well). Just overwriting fs turned out to be much more reliable in those cases for me.