Skip to main content

Databricks on AWS relies on custom machine images, AMIs, deployed as EC2 instances in the customer’s account. These EC2 instances provide the elastic compute for Databricks clusters. In the Databricks architecture, customer code runs in low-privileged containers rather than the host itself. This model is an important security control, especially on our user isolation clusters, where multiple users with different entitlements share compute resources. As part of our Enhanced Security Monitoring feature, we provide AMIs with advanced hardening and pre-installed security agents that customers can use for detection and alerting for suspicious activity. In this blog we cover some useful queries to detect possible malicious activity based on included Capsule8 alerts.

Using Databricks Workspace Audit Logs

First, customers need to have the Enhanced Security Monitoring (ESM) feature enabled. This can be done by contacting your Databricks representative. It’s also automatically part of our compliance security profile for HIPAA, PCI-DSS, and FedRAMP. Once ESM is enabled you should ensure that you’ve enabled audit log delivery and are loading the logs to be queried from Databricks. We have example Delta Live Tables pipelines that make this very simple. Once the logs are ingested as Delta tables, we can efficiently query them either via DBSQL or notebooks.

Understanding Capsule8 Alerts

Databricks has enabled specific Capsule8 detections on the ESM AMIs which are documented on our site. For this example, we will focus on a subset of the overall alerts, but customers should work with their security teams to prioritize which detections are most significant to their environment. In this case, we are focusing primarily on events such as potential container escapes, kernel exploits, and suspicious changes to the host OS that could impact the security and stability of the node. Because of the nature of Databricks clusters, user code does not have access to the host OS or base AMI, so alerts such as these could be good indicators of suspicious activity on the host or malicious users on the platform.

The 4 main categories of events we’ll be focusing on are:

* Container Escape
Given user code runs in low-privileged containers, a container escape would certainly be a significant event that could compromise the security of a cluster. In particular, on user isolation or Table ACL clusters, a container escape could lead to a data leak or other exposure.

* Kernel Related Events
Again, user code does not have privileges on the host OS and certainly not the permissions to load kernel modules. These types of kernel related events could point to something malicious on the host or as a follow-on from a container escape.

* Host Security Changes
Changes to host security configurations such as AppArmor, boot files, or the certificate store would be unusual and should be investigated.

* Other Suspicious Activity
Once an instance is active and assigned to a cluster there should not be any activity on the host such as new interactive shells, new files being executed in containers, or privileged containers being launched.

Monitoring Capsule8 Alerts

We’ll show how to monitor for Capsule8 alerts using Delta Live Tables, but customers can of course use any log monitoring tool for this since workspace audit logs are delivered as standard JSON files. In this case, we apply detections as workspace audit logs are ingested via a Delta Live Tables pipeline. We populate an alerts table based on filters for the above detections using simple SQL filter expressions.

detections = {
    "container-escape": "actionName in ('Container Escape via Kernel Exploitation', 'Userland Container Escape', 'New File Executed in Container', 'Privileged Container Launched')",
    "host-security": "actionName in ('Processor-Level Protections Disabled', 'AppArmor Disabled In Kernel', 'AppArmor Profile Modified', 'Boot Files Modified', 'Root Certificate Store Modified')",
    "kernel-exploit": "actionName in ('BPF Program Executed', 'Kernel Module Loaded', 'Kernel Exploit')",
    "suspicious-activity": "actionName in ('New File Executed in Container', 'Suspicious Interactive Shell', 'User Command Logging Evasion', 'Privileged Container Launched')"
}

# we inverse the detection logic for DLT expectations
detection_expectations = dict([(key, f"not({value})") for (key,value) in detections.items()])

@dlt.table(
  name="workspace_audit_logs",
  partition_cols=["date", "workspaceId", "serviceName"],
  table_properties={
    "pipelines.autoOptimize.managed": "true",
    "delta.autoOptimize.optimizeWrite": "true",
    "delta.autoOptimize.autoCompact": "true"
  }
)
@dlt.expect("clean_schema", "_rescued_data is null")
@dlt.expect_all(detection_expectations)
def workspace_audit_logs_ingest():
  return (spark.readStream
          .format("cloudFiles")
          .option("cloudFiles.format", "json")
          .option("cloudFiles.inferColumnTypes", "true")
          .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
          .load(workspace_logs_ingest_path)
          .withColumn("filename", input_file_name()))

@dlt.table(
    name="alerts",
    table_properties={
        "pipelines.autoOptimize.managed": "true",
        "delta.autoOptimize.optimizeWrite": "true",
        "delta.autoOptimize.autoCompact": "true"
      }
)
def alerts():
    logs = dlt.read_stream("workspace_audit_logs")
    alerts = compute_alerts(logs, detections)
    
    return alerts.filter("size(alerts) > 0") # only return records with alerts

The detections are also applied as Delta Live Table expectations. This gives us a nice indicator in the pipeline UI to quickly show us if any records matched our detection conditions. For proactive alerting, we can use DBSQL alerts to send us an email, Slack message, or even call an arbitrary webhook when any new detection is triggered.

DLT pipeline for ingesting audit logs and alerting on Capsule8 events
DLT pipeline for ingesting audit logs and alerting on Capsule8 events

The Capsule8 events include the AWS instance id for the host that triggered the alert. We can use this instance id to correlate with other logs such as CloudTrail or VPC flow logs. In addition to AWS logs, our new verbose audit logs can also be used by analysts to review any notebook commands executed in a workspace as part of an investigation into any Capsule8 alerts or other incidents.

Conclusion

With Enhanced Security Monitoring, Databricks customers gain additional visibility into the security of the infrastructure supporting their deployment. Using Delta Live Tables is a reliable and scalable method to ingest and process security log data into a cyber lakehouse. With a few simple queries we can easily alert on and investigate any potentially suspicious activity.

To enable Enhanced Security Monitoring, please contact your Databricks representative.

Try Databricks for free

Related posts

See all Platform Blog posts