30.07.2021

The condemned live longer: Symlink races

Symlink race vulnerabilities are common issues we encounter while performing code audits. One might think that such bugs should not happen anymore and every developer can easily avoid them - sadly the reality is different. In this blog post, we will have a closer look at a symlink race vulnerability from 2018 in docker. We think the vulnerability is quite interesting since it is easy to exploit but not so obvious to note while reviewing. Attentive readers may ask themselves whether they’d have noticed the issue while developing or reviewing the affected lines of code.

Due to their time-dependent nature, they are not always easy to exploit and therefore remain underrated. On the other hand, sometimes they are exploitable with surprisingly little effort. One of these bugs is the docker vulnerability CVE-2018-15664 from 2018.

Overview

Docker offers a way to copy files from/to a container. The basic syntax of the copy tool is:

docker cp [OPTIONS] CONTAINER:SRC_PATH DEST_PATH|-
docker cp [OPTIONS] SRC_PATH|- CONTAINER:DEST_PATH

So to update the container’s /etc/passwd file, a system administrator can execute a command such as:

docker cp new_passwd ea7d702924bc:/etc/passwd

Where ea7d702924bc is the container ID.

Internally, docker cp shifts execution to the docker daemon which runs as root. The daemon will resolve the path to the actual host path of the docker storage. ea7d702924bc:/etc/passwd would resolve to /var/lib/docker/btrfs/subvolumes/6383eed3/etc/passwd

Where /var/lib/docker/btrfs/subvolumes/6383eed3 is the root directory from the container point of view. Depending on the type of docker storage the path might be different. In this example, we used btrfs subvolumes as storage. For better illustration, please read BASEDIR from now on as /var/lib/docker/btrfs/subvolumes/6383eed3.

Things get interesting when symlinks are involved. Let’s assume that within the container, /etc/ is a symlink pointing to /. The Docker daemon will determine that the last path component of BASEDIR/etc is a symlink. Simply following the symlink is not allowed: It would redirect to /, in this context, the host filesystem root. So docker reads the contents of the symlink and applies it to the base directory of the container. The final path is BASEDIR/.

Therefore the passwd file will land in the container root directly instead of /etc/. This is expected and nothing to worry about.

Under the hood

To ensure that all file operations run within the container filesystem, docker sets up a new mount namespace. The interesting steps are:

  1. Resolving the container path, including symlink check and sanitizing
  2. Creating a new process with a new mount namespace
  3. Sanitizing the new mount namespace with pivot_root() so that the root directory is within the container root
  4. Performing the file operation relative to /.

So, a simplified sequence of executed system calls could be:

[pid  1724] lstat("BASEDIR/etc", {st_mode=S_IFLNK|0777, st_size=1, ...}) = 0
[pid  1724] readlinkat(AT_FDCWD, "BASEDIR/etc", "/", 128) = 1
[pid  1724] lstat("BASEDIR", {st_mode=S_IFDIR|0755, st_size=206, ...}) = 0
[pid  1724] fork() = 21681
[pid 21681] unshare(CLONE_NEWNS)
[pid 21681] mkdirat(AT_FDCWD, "BASEDIR/.pivot_root918928353", 0700) = 0
[pid 21681] pivot_root("BASEDIR", "BASEDIR/.pivot_root918928353") = 0
[pid 21681] chdir("/")                  = 0
[pid 21681] mount("", "/.pivot_root918928353", 0xc420246fd3, MS_REC|MS_PRIVATE, NULL) = 0
[pid 21681] umount2("/.pivot_root918928353", MNT_DETACH) = 0
[pid 21681] unlinkat(AT_FDCWD, "/.pivot_root918928353", AT_REMOVEDIR) = 0
[pid 21681] openat(AT_FDCWD, "/passwd", O_WRONLY|O_CREAT|O_CLOEXEC, 0644) = 6

We observe that docker assumes that the sanitized path cannot change while it works on it. That’s where the root of the problem comes from. Consider that /etc within the container is not a symlink at the time of the check but it is changed to a symlink pointing to / right before mkdirat().

Under this assumption the system call sequence could be like:

[pid 15509] lstat("BASEDIR/etc", {st_mode=S_IFDIR|0755, st_size=18, ...}) = 0
[pid 15509] fork() = 21630
[pid 21630] unshare(CLONE_NEWNS)
[pid 21630] mkdirat(AT_FDCWD, "BASEDIR/etc/.pivot_root447772558", 0700) = 0
[pid 21630] pivot_root("BASEDIR/etc/", "BASEDIR/etc/.pivot_root447772558") = 0
[pid 21630] chdir("/")                  = 0
[pid 21630] mount("", "/.pivot_root447772558", 0xc4208981ed, MS_REC|MS_PRIVATE, NULL) = 0
[pid 21630] umount2("/.pivot_root447772558", MNT_DETACH) = 0
[pid 21630] unlinkat(AT_FDCWD, "/.pivot_root447772558", AT_REMOVEDIR) = 0
[pid 21630] openat(AT_FDCWD, "/passwd", O_WRONLY|O_CREAT|O_CLOEXEC, 0644) = 6

If etc is suddenly a symlink to /, the kernel will resolve BASEDIR/etc/ to / in mkdirat() and pivot_root(). Same for BASEDIR/etc/.pivot_root447772558 - it will turn into /.pivot_root447772558. This makes the pivot_root() operation basically a no-op since we switched the host root filesystem to /.pivot_root447772558 and back again.

As soon docker opens /passwd inside the new mount namespace, it will be the host filesystem instead of the root of the targeted container.

Exploiting the problem

In the attack scenario, the attacker has control over a container and can execute code inside of it. He knows that the system administrator either uses docker cp manually or automatically to copy a file into the container.

We saw that docker assumes that the path cannot change while the copy operation runs and that we can get access to the host filesystem if we manage to trick docker by presenting a directory at check time and later a symlink at use time. Hitting exactly the right times to present either a directory or a symlink might sound hard to achieve. There is a more simple approach: a brute force attack. We don’t care if the majority of all tries don’t work, as long as one works.

So we create two files, /etc being a directory and /etc.sym being a symlink to /. To apply the brute force attack /etc and /etc.sym filenames are exchanged forever. At any point in time /etc is either a directory or a symlink to /. To achieve that we use Linux’s renameat2() system call.

The exploit basically is a program performing

renameat2(AT_FDCWD, "/etc", AT_FDCWD, "/etc.sym", RENAME_EXCHANGE);

in a loop.

The crucial points in time are:

  • T1: the time of the path being checked
  • T2: the point when mkdirat() runs to create the pivot directory
  • T3: when pivot_root() takes place

/etc within the container can either be in state D being a directory or in state S being a symlink. So we have three points in time which can have two states. To sum it up: there are eight possible attack states our brute force exploit can trigger. We already know that the attack can only be successful if /etc/ is a directory at check time and then a symlink. Therefore only attack state A4 will succeed.

This rough estimation shows that one-eighth of all attacks are successful. Not too bad for a brute force attack. Please note that this is just an estimation that leaves out some details. For example, that the time difference between T1 and T2 is large since execution spawns over two processes while the time between T2 and T3 is minimal since both execution points are in the same process and successive. So hitting the race between T1 and T2 is much more likely than between T2 and T3.

Let’s take a deeper look at the state where the attack is not successful. We distinguish two types of failures:

  • unnoticed failures
  • noticed failures

For example, if at time of mkdirat() etc is a symlink and later at pivot_root() again a directory, pivot_root() will fail and docker cp reports an error.

A1 A2 A3 A4 A5 A6 A7 A8
T1 D D D D S S S S
T2 D D S S S S D D
T3 D S D S S D S D
U E E X U U U U
D ... etc is a directory
S ... etc is a symlink

U ... attack unsuccessful, unnoticed
E ... attack unsuccessful, errors reported
X ... exploited!

Impact

Under optimal circumstances, an attacker within a container can overwrite any file on the host side. On the other hand, the attacker needs to know which file is being written using docker cp and if the copy operation runs very seldom the chances to hit the race window are low.

Identifying the attack

In a previous section, we saw that there is a chance where docker cp will fail and report an error. If docker cp runs as part of your automation, check logs for failures - they could indicate an attack. Another possibility to detect ongoing attacks is stale .pivot_rootXXXXXXXX directories on the host side, where XXXXXXXX is a number. If the attacker tries to place a file in the host root filesystem, unsuccessful attacks leave a .pivot_rootXXXXXXXX behind.

Fixing the problem

The currently implemented solution is suspending the container while the copy operation runs. With that, the docker daemon can be sure that nothing inside the container can change paths while it performs the copy operation. The downside of this approach is that suspending a container is slow and will interrupt workloads.

Final thoughts

Symlink races, or more general TOCTOU (Time-of-Check-to-Time-of-Use) issues, are still a problem and not always obvious. To avoid them while writing code, always ask yourself whether a system resource can change after you did some check on it. Usually, the answer is yes, it can change when the check is performed by your code and not by the operating system. Especially when it comes to checks on filesystem resources such as symlinks. If you are using a high-level programming language, try to understand what operations will be made by it at a system call level. Just because it’s a single operation in your favorite programming language it doesn’t have to be atomic.

References