The Databricks Lakehouse Platform provides a unified set of tools for building, deploying, sharing, and maintaining enterprise-grade data solutions at scale. Databricks integrates with Google Cloud & Security in your cloud account and manages and deploys cloud infrastructure on your behalf.
The overarching goal of this article is to mitigate the following risks:
Databricks supports several GCP native tools and services that help protect data in transit and at rest. One such service is VPC Service Controls, which provides a way to define security perimeters around Google Cloud resources. Databricks also supports network security controls, such as firewall rules based on network or secure tags. Firewall rules allow you to control inbound and outbound traffic to your GCE virtual machines.
Encryption is another important component of data protection. Databricks supports several encryption options, including customer-managed encryption keys, key rotation, and encryption at rest and in transit. Databricks-managed encryption keys are used by default and enabled out of the box. Customers can also bring their own encryption keys managed by Google Cloud Key Management Service (KMS).
Before we begin, let's look at the Databricks deployment architecture here:
Databricks is structured to enable secure cross-functional team collaboration while keeping a significant amount of backend services managed by Databricks so you can stay focused on your data science, data analytics, and data engineering tasks.
Databricks operates out of a control plane and a data plane.
The following diagram represents the flow of data for Databricks on Google Cloud:
Let's understand the communication path we want to secure. Databricks could be consumed by users and applications in numerous ways, as shown below:
A Databricks workspace deployment includes the following network paths to secure
From end-user perspective, the paths 1 & 2 require ingress controls and 3,4,5 egress controls
In this article, our focus area is to secure egress traffic from your Databricks workloads, provide the reader with prescriptive guidance on the proposed deployment architecture, and while we are at it, we'll share best practices to secure ingress (user/client into Databricks) traffic as well.
Before you begin, please ensure that you are familiar with these topics
There are several ways to implement the proposed deployment architecture
Irrespective of the approach you use, the resource creation flow would look like this:
This is a prerequisite step. How the required infrastructure is provisioned, i.e. using Terraform or Gcloud or GCP cloud console, is out of the scope of this article. Here's a list of GCP resources required:
GCP Resource Type | Purpose | Details |
---|---|---|
Project | Create Databricks Workspace (ws) | Project requirements |
Service Account | Used with Terraform to create ws | Databricks Required Role and Permission. In addition to this you may also need additional permissions depending upon the GCP resources you are provisioning. |
VPC + Subnets | Three subnets per ws | Network requirements |
Private Google Access (PGA) | Keeps traffic between Databricks control plane VPC and Customers VPC private | Configure PGA |
DNS for PGA | Private DNS zone for private api's | DNS Setup |
Private Service Connect Endpoints | Makes Databricks control plane services available over private ip addresses. Private Endpoints need to reside in its own, separate subnet. |
Endpoint creation |
Encryption Key | Customer-managed Encryption key used with Databricks | Cloud KMS-based key, supports auto key rotation. Key could be "software" or "HSM" aka hardware-backed keys. |
Google Cloud Storage Account for Audit Log Delivery | Storage for Databricks audit log delivery | Configure log delivery |
Google Cloud Storage (GCS) Account for Unity Catalog | Root storage for Unity Catalog | Configure Unity Catalog storage account |
Add or update VPC SC policy | Add Databricks specific ingress and egress rules | Ingress & Egress yaml along with gcloud command to create a perimeter. Databricks projects numbers and PSC attachment URI's available over here. |
Add/Update Access Level using Access Context Manager | Add Databricks regional Control Plane NAT IP to your access policy so that ingress traffic is only allowed from an allow listed IP | List of Databricks regional control plane egress IP's available over here |
Variable | Details |
---|---|
google_service_account_email | [NAME]@[PROJECT].iam.gserviceaccount.com |
google_project_name | PROJECT where data plane will be created |
google_region | E.g. us-central1, supported regions |
databricks_account_id | Locate your account id |
databricks_account_console_url | https://accounts.gcp.databricks.com |
databricks_workspace_name | [ANY NAME] |
databricks_admin_user | Provide at least one user email id. This user will be made workspace admin upon creation. This is a required field. |
google_shared_vpc_project | PROJECT where VPC used by dataplane is located. If you are not using Shared VPC then enter the same value as google_project_name |
google_vpc_id | VPC ID |
gke_node_subnet | NODE SUBNET name aka PRIMARY subnet |
gke_pod_subnet | POD SUBNET name aka SECONDARY subnet |
gke_service_subnet | SERVICE SUBNET SUBNET name aka SECONDARY subnet |
gke_master_ip_range | GKE control plane ip address range. Needs to be /28 |
cmek_resource_id | projects/[PROJECT]/locations/[LOCATION]/keyRings/[KEYRING]/cryptoKeys/[KEY] |
google_pe_subnet | A dedicated subnet for private endpoints, recommended size /28. Please review network topology options available before proceeding. For this deployment we are using the "Host Databricks users (clients) and the Databricks dataplane on the same network" option. |
workspace_pe | Unique name e.g. frontend-pe |
relay_pe | Unique name e.g. backend-pe |
relay_service_attachment | List of regional service attachment URI's |
workspace_service_attachment | List of regional service attachment URI's |
private_zone_name | E.g. "databricks" |
dns_name | gcp.databricks.com. (. is required in the end) |
If you do not want to use the IP-access list and would like to completely lock down workspace access (UI and APIs) outside of your corporate network, then you would need to:
Upon successful deployment, the Terraform output would look like this:
backend_end_psc_status = "Backend psc status: ACCEPTED"
front_end_psc_status = "Frontend psc status: ACCEPTED"
workspace_id = "workspace id: <UNIQUE-ID.N>"
ingress_firewall_enabled = "true"
ingress_firewall_ip_allowed = tolist([
"xx.xx.xx.xx",
"xx.xx.xx.xx/xx"
])
service_account = "Default SA attached to GKE nodes
databricks@<PROJECT>.iam.gserviceaccount.com"
workspace_url = "https://<UNIQUE-ID.N>.gcp.databricks.com"
We discussed utilizing cloud-native security control to implement data exfiltration protection for your Databricks on GCP deployments, all of which could be automated to enable data teams at scale. Some other things that you may want to consider and implement as part of this project are: