Skip to main content
Platform blog

Scanning for Arbitrary Code in Databricks Workspace With Improved Search and Audit Logs

Improved workspace search and audit logs to assist in scanning for arbitrary strings such as Common Vulnerabilities and Exposures (CVE) library
Share this post

How can we tell whether our users are using a compromised library?
How do we know whether our users are using that API?

These are the types of questions we regularly receive from our customers.

Given the recent surge in reports of vulnerable libraries such as Python ctx and the PHPass hijack, it is understandable that customers need to be able to perform a time sensitive investigation as soon as these issues are disclosed. They need to be able to confirm whether they are using any of the vulnerable libraries and to check if any of the malicious indicators exist across their estate. Whenever a potential security incident like this comes up, the Databricks Incident Response team of course performs an investigation into our product and our internal systems, but it is the customer's responsibility to ensure that they are not using the impacted libraries within their own codebase, either by referencing the affected version directly or by using libraries that transitively depend on it. In these types of scenarios, Databricks typically recommends that customers evaluate whether their code utilizes the impacted library in any way, now you can search your workspace for any string.

The goal of this blog is to inform you of a new and improved workspace search feature and audit log capabilities that you can use to scan notebooks, libraries, folders, files, and repos by name and also search for any arbitrary string within a notebook, such as the library used in the latest supply chain compromise. But you can also search for anything else! This new search feature will help you answer security questions about compromised libraries more quickly and easily, and get ahead of the attackers.

Background

To make our customer's lives easier, Databricks automatically incorporates many commonly used libraries into the Databricks Runtime (DBR). To see which libraries are included, please refer to the System Environment subsection of the Databricks Runtime release notes for the relevant DBR version. Databricks is responsible for keeping these libraries up to date so that all our customers need to do is to regularly restart their clusters to take advantage of them. At Databricks, we take application security very seriously. Check out our Security and Trust Center for more information about this.

However, as a general purpose data analytics platform, Databricks enables customers to install whatever publicly or privately available Python, Java, Scala, or R libraries they need in order to fulfill their use case. Therefore, if one such library is compromised, our customers need to be able to look at their own codebase to validate whether there is any impact. Databricks recommends that customers evaluate whether their code utilizes potentially impacted libraries on a regular basis.

Searching for arbitrary code in Databricks Workspace


Figure 1: In this example "mlflow" (not a compromised library) is searched in notebooks, libraries, folders, files, and repos by name and also searches for content within a notebook and shows preview of the matching content.

We invite you to log in to your own Databricks account and try running some searches in your workspace using improved workspace search for yourself. Please See Search workspace for an object in our docs for more details.

To search the workspace for an object, click Search in the sidebar. The Search dialog appears.

New and improved workspace search

To search for a text string, type it into the search field and press Enter. The system searches the names of all notebooks, folders, files, libraries, and Repos in the workspace that you have access to, as an admin you should be able to search the objects in the workspace . It also searches notebook commands, but not text in non-notebook files.
You can also search for items by type (file, folder, notebooks, libraries, or repo). A text string is not required. When you press Enter, workspace objects that match the search criteria appear in the dialog. Click a name from the list to open that item in the workspace.


Figure 1: In this example "ctx" (a library known to be compromised) is searched in notebooks, libraries, folders, files, and repos by name and also searches for content within a notebook and showing preview of the matching content. We further filtered the results in Notebooks by a specific user to narrow the search

Note:
The search behavior described in this blog is not supported on workspaces that use customer-managed keys for encryption. In those workspaces, you can use this notebook utility to assist in scanning Databricks workspace for arbitrary strings. Please reach us if you need further assistance with the notebook.

Ongoing detection with verbose audit logging

Security investigations into zero day exploits are rarely straightforward - sometimes they can run on for several months. During this time, security teams may want to couple point-in-time searches with ongoing monitoring and alerting, to ensure that a vulnerable library isn't imported the day after they've confirmed it doesn't feature in their code.

Databricks customers can now leverage verbose audit logging of all notebook commands ran during interactive development (see the docs for AWS, Azure) and if they have set up audit log delivery and processing in the way described by our recent blog on this topic, they could use a Databricks SQL query like the below to search notebook commands for strings like "import ctx":

SELECT
  timestamp,  
  workspaceId,
  sourceIPAddress,
  email,
  requestParams.commandText,
  requestParams.status,
  requestParams.executionTime,
  requestParams.notebookId,
  result,
  errorMessage
FROM
  audit_logs.gold_workspace_notebook
 WHERE actionName = "runCommand"
 AND contains(requestParams.commandText, {{query_string}})
 ORDER BY timestamp DESC

But that's still ad hoc querying right? True, but with some simple modifications, this query could easily be converted into a Databricks SQL alert which is scheduled to run at regular intervals and send an email notification if a specific library has been used more than once (count of events is > 0) in the last day:

SELECT
  date,  
  workspaceId,
  sourceIPAddress,
  email,
  requestParams.commandText,
  count(*) AS total
FROM
  audit_logs.gold_workspace_notebook
 WHERE actionName = "runCommand"
 AND contains(requestParams.commandText, "import ctx")
 AND date > current_date - 1
 GROUP BY 1, 2, 3, 4, 5
 ORDER BY date DESC

This could be coupled with a custom alert template like the following to give security teams enough information to investigate whether the acceptable use policy has been violated:

Alert "{{ALERT_NAME}}" changed status to {{ALERT_STATUS}}

There have been the following unexpected events in the last day:

{{QUERY_RESULT_ROWS}}

Check out our documentation for instructions on how to configure alerts (AWS, Azure), as well as for adding additional alert destinations like Slack or PagerDuty (AWS, Azure).

Conclusion

In this blog post you learned how easy it is to search using improved search for arbitrary code in a Databricks Workspaces and also leverage audit logs for monitoring and alerting for vulnerable libraries. You also saw an example of how to hunt for signs of a compromised library. Stay tuned for more search capabilities in months to come.

We look forward to your questions and suggestions. You can reach us at: [email protected]. Also if you are curious about how Databricks approaches security, please review our Security & Trust Center.

Try Databricks for free

Related posts

Platform blog

Custom DNS With AWS Privatelink for Databricks Workspaces

This post was written in collaboration with Amazon Web Services (AWS). We thank co-authors Ranjit Kalidasan , senior solutions architect, and Pratik Mankad...
Engineering blog

10 Minutes from pandas to Koalas on Apache Spark

This is a guest community post from Haejoon Lee, a software engineer at Mobigen in South Korea and a Koalas contributor. pandas is...
Platform blog

Augment Your SIEM for Cybersecurity at Cloud Scale

July 23, 2021 by Michael Ortega and Monzy Merza in Platform Blog
Over the last decade, security incident and event management tools (SIEMs) have become a standard in enterprise security operations. SIEMs have always had...
See all Product posts