Last updated on: August 23, 2024
In the previous blog, we discussed how to securely access Azure Data Services from Azure Databricks using Virtual Network Service Endpoints or Private Link.
Here’s a quick recap:
Service Principals: Use Azure AD service principles for secure authentication.
Managed Identity: Leverage managed identities for secure access without handling credentials.
Azure Key Vault: Store and manage secrets securely using Azure Key Vault.
VNet and Private Endpoints: Ensure secure networking with VNet injection and private links.
In this article we walkthrough detailed steps on how to harden your Azure Databricks deployment from a network security perspective in order to prevent data exfiltration.
As per wikipedia: Data exfiltration occurs when malware and/or a malicious actor carries out an unauthorized data transfer from a computer. It is also commonly called data extrusion or data exportation. Data exfiltration is also considered a form of data theft. Since the year 2000, a number of data exfiltration efforts severely damaged the consumer confidence, corporate valuation, and intellectual property of businesses and national security of governments across the world. The problem assumes even more significance as enterprises start storing and processing sensitive data (PII, PHI or Strategic Confidential) with public cloud services.
Solving for data exfiltration can become an unmanageable problem if the PaaS service requires you to store your data with them or it processes the data in the service provider's network. But with Azure Databricks, our customers get to keep all data in their Azure subscription and process it in their own managed private virtual network(s), all while preserving the PaaS nature of one the fastest growing Data & AI service on Azure. We've come up with a secure deployment architecture for the platform while working with some of our most security-conscious customers, and it's time that we share it out broadly.
There are three distinct flavors of Databricks workspace deployments from a network perspective.
Please note that no matter what options you choose, the virtual network used by Databricks will reside in your Azure subscription. The rest of this article is built around option 3 i.e. Deploy workspace in a Customer managed virtual network with secure cluster connectivity and Private Link.
From here: There are two types of Private Link deployment that Azure Databricks supports, and you must choose one:
Standard deployment (recommended): For improved security, Databricks recommends you use a separate private endpoint for your front-end connection from a separate transit VNet. You can implement both front-end and back-end Private Link connections or just the back-end connection. Use a separate VNet to encapsulate user access, separate from the VNet that you use for your compute resources in the Classic data plane. Create separate Private Link endpoints for back-end and front-end access. Follow the instructions in Enable Azure Private Link as a standard deployment.
Simplified deployment: Some organizations cannot use the standard deployment for various network policy reasons, such as disallowing more than one private endpoint or discouraging separate transit VNets. You can alternatively use the Private Link simplified deployment. No separate VNet separates user access from the VNet that you use for your compute resources in the Classic data plane. Instead, a transit subnet in the data plane VNet is used for user access. There is only a single Private Link endpoint. Typically both front-end and back-end connectivity are configured. You can optionally only configure the front-end connection. You cannot choose to use only the back-end connections in this deployment type. Follow the instructions in Enable Azure Private Link as a simplified deployment.
We recommend a hub and spoke topology styled reference architecture. The hub virtual network houses the shared infrastructure required to connect to validated sources and optionally to an on-premises environment. And the spoke virtual networks peer with the hub, while housing isolated Azure Databricks workspaces for different business units or segregated teams.
Such a hub-and-spoke architecture allows creating multiple-spoke VNETs for different purposes and teams. It is also possible to implement isolation by creating separate subnets for different teams within a large contiguous virtual network. In such instances, it's totally possible to set up multiple isolated Azure Databricks workspaces in their own subnet pairs, and deploy Azure Firewall in another sister subnet within the same virtual network.
High-level view:
Steps to deploy a secure Azure Databricks deployment:
Why do we need two subnets per workspace?
A workspace requires two subnets, popularly known as "host" (a.k.a "public") and "container" (a.k.a "private") subnets. Each subnet provides an ip-address to the host (Azure VM) and the container (Databricks runtime aka dbr) which runs inside the VM.
Does the public or host subnet have public ips?
No, when you create a workspace using secure cluster connectivity aka SCC, none of Databricks subnets have public IP addresses. It is just that the default name of the host subnet is public-subnet. SCC makes sure that no network traffic from outside of your network enters e.g. SSH into one of the Databricks workspace compute instances.
Is it possible to resize/change the subnet sizes after the deployment?
Yes, it is possible to resize or change the subnet sizes after the deployment. It is not possible to change the virtual network or change the subnet names. Please reach out to Azure support, submit a support case for resizing the subnets.
Item | Details |
---|---|
Virtual Network | Virtual network to deploy Azure Databricks Dataplane (a.k.a VNet Injection). Make sure to choose the right CIDR blocks. |
Subnets | Three subnets Host (Public), Container (Private) and Private endpoint Subnet (to hold private endpoints for the storage, dbfs and other azure services that you may use) |
Route Tables | Channel Egress traffic from the Databricks Subnets to network appliance, Internet or On-prem data sources |
Azure Firewall | Inspect any egress traffic and take actions according to allow / deny policies |
Private DNS Zones | Provide reliable, secure DNS service to manage and resolve domain names in a virtual network (can be automatically created as part of the deployment if not available) |
Azure Key Vault | Stores the CMK for encrypting DBFS, Managed Disk and Managed Services. |
Azure Databricks Access Connector | Required if enabling Unity Catalog. To connect managed identities to an Azure Databricks account for the purpose of accessing data registered in Unity Catalog |
List of Azure Databricks services to allow list on Firewall | Please follow this public doc and make a list of all the ip’s and domain names relevant to your databricks deployment |
The default deployment of Azure Databricks creates a new virtual network (with two subnets) in a resource group managed by Databricks. So as to make necessary customizations for a secure deployment, the workspace data plane should be deployed in your own virtual network aka vnet injected workspace with NPIP. This deployment can be done using Azure Portal or All in one ARM templates or using Azure Databricks Terraform Providers.
Create a virtual network in a resource group with 3 subnets (host/public, container/private and pe ). Note that the subnet pe is used for private endpoints, to ensure all application data is being accessed securely over Azure network backbone. The host (public) and container (private) subnets need to be determined based on the use cases before the workspace deployment. Once the Databricks workspace is deployed , it is not possible to resize / change the Databricks network subnets.
Deploy Azure Databricks from Azure Portal
Create Azure Databricks workspace with Vnet Injection and No Public IP (SCC) from the Azure Portal
Click Review and Create. Few things to note:
Inbound Rules
Worker to Worker Inbound rule allows traffic between cluster instances.
Outbound Rules
Azure Databricks creates a default blob storage (a.k.a root storage) during the deployment process which is used for storing logs and telemetry. Even though public access is enabled on this storage, the Deny Assignment created on this storage prohibits any direct external access to the storage; it can be accessed only via the Databricks workspace. Azure Databricks deployments now support secure connection to the root blob storage (DBFS) with the creation of Private Endpoint (both dfs and blob), but enabling private endpoint for DBFS does not turn off public access. Note that the Private Endpoints for storage incurs additional cost.
As a best practice It is NOT recommended to store any application data in the root blob (DBFS) storage. Leverage separate ADLS Gen2 Storage to store any application specific data using private link (Securely Accessing Azure Data Services)
We do not recommend setting up access to such data services through a network virtual appliance / firewall, as that has a potential to adversely impact the performance of big data workloads and the intermediate infrastructure.
NOTE: It is highly recommended to store the application data on an external ADLS Gen2 Storage. Follow through similar setup to create private link endpoints for the external ADLS storages to access / store data securely.
To configure such private endpoints for additional services, please refer to the relevant Azure documentation.
Azure Firewall is a scalable cloud native firewall that can act as the filtering device for any allowed public endpoints to be accessible from your Azure Databricks workspace.
Typically, Firewalls are placed on the centralized Hub VNet and peered with multiple Spoke Vnet. The Spoke Vnet egresses all the traffic via the Firewall.
Azure Firewall policies are the recommended approach to create rules for the Azure Firewall. The firewall policies are global resources that can be used across multiple Azure Firewall instances.
Create a network rule (ip address based) and application rule (FQDN based) collection. Example below shows a representative set of rules, for exact details please refer to the complete list of control plane assets relevant to your deployment region.
Note:-
Attach the firewall policy to the firewall.
At this point, the majority of the infrastructure setup is completed. Next we need to route traffic from Azure Databricks workspace subnets to Azure Firewall.
Create a Route table and forward all the traffic by adding a 0.0.0.0/0 rule to the Virtual appliances (azure firewall).
Finally, the virtual network azuredatabricks-spoke-vnet and hub-vnet need to be peered so that the route table configured earlier could work properly. Follow through the documentation to setup Vnet peering between Hub and Spoke Networks.
The setup is now complete.
We are now at the last step. Now, assign the workspace to Unity Catalog.
It's time to put everything to test now:
If the data access worked without any issues, that means you've accomplished the optimum secure deployment for Azure Databricks in your subscription. This was quite a bit of manual work, but that was more for a one-time showcase. In practical terms, you would want to automate such a setup using a combination of ARM Templates, Azure CLI, Azure SDK etc.:
Yes, Service Endpoint provides secure and direct connectivity to Azure services owned and managed by customers (ex: ADLS gen2, Azure KeyVault or eventhub) over an optimized route over the Azure backbone network. Service Endpoints can be used to secure connectivity to external Azure resources to only your virtual network.
No, subnets used by Databricks are locked using a network intent policy, this prevents service endpoint policy enforcement on Databricks managed storage services used by our artifacts and logs service and event hub which is used by health monitoring service. Azure network intent policies are an internal network construct to prevent customers from accidentally modifying the subnets used by Databricks.
Yes, you could use a third-party NVA as long as network traffic rules are configured as discussed in this article. Please note that we have tested this setup with Azure Firewall only, though some of our customers use other third-party appliances. It's ideal to deploy the appliance on cloud rather than be on-premises.
Yes, you can. As per Azure reference architecture, it is advisable to use a hub-spoke virtual network topology to plan better for the future. Should you choose to create the Azure Firewall subnet in the same virtual network as Azure Databricks workspace subnets, you wouldn't need to configure virtual network peering as discussed in Step 6 above.
Can I filter Azure Databricks control plane SCC Relay Ip traffic through Azure Firewall?
Yes you can but we would like you to keep these points in mind:
Can I analyze accepted or blocked traffic by Azure Firewall?
We recommend using Azure Firewall Logs and Metrics for that requirement.
Can I Upgrade an existing non-NPIP (managed Databricks deployment) to NPIP or PL Enabled workspace ?
No, managed databricks deployment cannot be upgraded to a Vnet Injected workspace. Databricks recommends creation of new Vnet Injected workspace and migrate workspace artifacts.
We discussed utilizing cloud-native security control to implement data exfiltration protection for your Azure Databricks deployments, all of which could be automated to enable data teams at scale. Some other things that you may want to consider and implement as part of this project:
Please reach out to your Microsoft or Databricks account team for any questions.