Databricks became aware of a new critical runc vulnerability (CVE-2019-5736) on February 12, 2019 that allows malicious container users to gain root access to the host operating system. This vulnerability affects many container runtimes, including Docker and LXC. The Databricks security team has evaluated the vulnerability and confirmed that, due to the Databricks platform architecture, there is no external vector by which an attacker could exploit the flaw to gain access to the host VM on which the containers reside. Additionally, our architecture isolates each customer by providing each customer with a separate host VM located within the customer’s cloud services account, so this exploit would not permit any cross-customer access, even if the underlying container were compromised.
This CVE includes two attack vectors:
- Creating a new container using an attacker-controlled image.
Databricks only launches containers built by the Databricks engineering team, so malicious external users have no way of launching their own image.
- Attaching to an existing container which the attacker had previous write access to.
Only Databricks services can attach to existing containers. Users access containers through RPCs, and cannot attach to existing containers.
Though we believe the vulnerability is unlikely to be practically exploitable in our environment, Databricks engineering will push a hotfix that will be deployed as soon as reasonably possible.
How does the exploit work in detail?
The exploit tries to compromise the container runtime binary to gain root access to the host. The container runtime is a binary program that runs on the host system and orchestrates the process execution inside the container. It is designed to ensure that the container’s processes are run in their own isolated namespace and with reduced privilege. On docker, the default container runtime is runC binary, and on LXC it is the miscellaneous lxc-* utilities.
Take lxc-attach as an example, a malicious user can mount the attack with the following steps:
- Replace a target binary inside the container with a custom content that points back to the lxc-attach binary itself. For example, one can replace the container’s /bin/bash with the following content:
<injected malicious payload goes here>
In this way, /bin/bash (container path) becomes an executable script using /proc/self/exe to interpret its malicious content. Note that /proc/self/exe is a symbolic link created by the kernel for every process which points to the binary that was executed for that process.
- Trick the container runtime into executing the target binary from the host system. As such when /bin/bash is executed inside the container, instead the target of /proc/self/exe will be executed — which will point to the container runtime binary on the host. In the example, when the attacker uses lxc-attach to run a command inside the container, lxc-attach invokes container’s /bin/bash using execve() syscall, which in turn runs /proc/self/exe i.e. lxc-attach itself to interpret the injected malicious payload.
- Proceed to write to the target of /proc/self/exe so as to overwrite the lxc-attach binary on the host. In general, however, this will not succeed as the kernel will not permit it to be overwritten while lxc-attach is executing. To overcome this, the attacker can instead open /proc/self/exe using the O_PATH flag to get a file descriptor <fd> and then reopen the binary as O_WRONLY through /proc/self/fd/<fd> and try to write to it in a busy loop from a newly forked subprocess. Eventually, it will succeed when the parent lxc-attach process exits. After this the lxc-attach binary on the host is compromised and can be used to attack other containers or the host itself. The rewriting logic can be done from the malicious payload injected to the target binary in step 1.
Therefore, there are 3 major conditions to enable the attack:
- The attacker must have or gain control the content of the image in order to replace the target binary inside the container. This is achievable if the attacker controls the container image or has write access to the container previously.
- The attacker must be able to invoke the container runtime on the host system through some external channel. This is the case if the host system exposes an API layer (e.g., kubelet API server) that allows users to invoke the container runtime binary indirectly. For example, if there’s an API allowing a remote user to launch a container with a custom image, or to attach to a running container using lxc-attach or docker exec
- The attacker must have permission to overwrite the content of the host’s container runtime binary from the container. This is possible if the container is running as a privileged user on the host system, but impossible if it is running as an unprivileged user.
Databricks only exposes an API to launch containers with trusted Databricks Runtime images released by our engineering team, and these containers are not subject to modification by users prior to being attached or created. Since an image that was modified after creation cannot be used to take advantage of this exploit, the trusted container status renders the Databricks standard architecture unaffected. Additionally, Databricks workspace users access containers through an RPC server running inside the container, and so cannot attach to existing containers using low-level container runtime binary.