Azure Databricks is a Unified Data Analytics Platform that is a part of the Microsoft Azure Cloud. Built upon the foundations of Delta Lake, MLFlow , Koalas and Apache Spark, Azure Databricks is a first party service on Microsoft Azure cloud that provides one-click setup, native integrations with other Azure services, interactive workspace, and enterprise-grade security to power Data & AI use cases for small to large global customers. The platform enables true collaboration between different data personas in any enterprise, like Data Engineers, Data Scientists, Data Analysts and SecOps / Cloud Engineering.
In this blog which is first in a series of two, we'll provide an overview of Azure Databricks architecture and how customers could connect to their own-managed instances of Azure data services in a secure manner.
Azure Databricks Architecture Overview
Azure Databricks is a managed application on Azure cloud. At a high-level, the architecture consists of a control / management plane and data plane. The control plane resides in a Microsoft-managed subscription and houses services such as web application, cluster manager, jobs service etc. In the default deployment, the data plane is a fully managed component in customer's subscription that includes a VNET, NSG and a root storage account known as DBFS.
The data plane could also be deployed in a customer-managed VNET, to allow the SecOps and Cloud Engineering teams build security & network architecture for the service as per their enterprise governance policies. This capability is called Bring Your Own VNET or VNET Injection. The picture shows a representative view of such customer architecture.
Secure connectivity to Azure Data Services
Enterprise Security is a core tenet of building software at both Databricks and Microsoft, and thus it’s considered as a first-class citizen in Azure Databricks. In the context of this blog, secure connectivity refers to ensuring that traffic from Azure Databricks to Azure data services remains on the Azure network backbone, with the inherent ability to whitelist Azure Databricks as an allowed source. As a security best practice, we recommend a couple of options which customers could use to establish such a data access mechanism to Azure Data services like Azure Blob Storage, Azure Data Lake Store Gen2, Azure Synapse Data Warehouse, Azure CosmosDB etc. Please read further for a discussion on Azure Private Link and Service Endpoints.
Option 1: Azure Private link
The most secure way to access Azure Data services from Azure Databricks is by configuring Private Link. As per Azure documentation - Private Link enables you to access Azure PaaS Services (for example, Azure Storage, Azure Cosmos DB, and SQL Database) and Azure hosted customer/partner services over a Private Endpoint in your virtual network. Traffic between your virtual network and the service traverses over the Microsoft network backbone, eliminating exposure from the public Internet. You can also create your own Private Link Service in your virtual network (VNet) and deliver it privately to your customers. The setup and consumption experience using Azure Private Link is consistent across Azure PaaS, customer-owned, and shared partner services. For details, please refer to this.
See below on how Azure Databricks and Private Link could be used together.
Azure Databricks and Azure Data Service Private Endpoints in separate VNETs
Azure Databricks and Azure Data Service Private Endpoints in same VNET
Private Endpoint Considerations
Please consider the following before implementing the private endpoint:
- Provides protection against data exfiltration by default. In the case of Azure Databricks, this would apply once customer whitelists access to specific services in the control plane.
- Keeps traffic on Azure network backbone i.e public network is not used for any data flow.
- Extends your private network address space to Azure Data services, i.e. the Azure data service effectively gets a private IP in one of your VNETs and could be treated as part of your larger private network.
- Connect privately to Azure Data services in other regions i.e. VNET in region A could connect to endpoints in region B via Private Link.
- Private Link is relatively bit more complex to set up as compared to other secure access mechanisms.
- See the documentation for a detailed list of Private Link benefits and the service specific availability.
One example of where one could use Private Link is when a customer uses a few Azure Data services in production along with Azure Databricks, like Blob Storage, ADLS Gen2, SQL DB etc. The business would like the users to query the masked aggregated data from ADLS Gen2, but restrict them from making their way to the unmasked confidential data in other data sources. In that case, a private endpoint could be established only for ADLS Gen2 service using any of the sub-options discussed above.
This is how such an environment could be configured:
1 - Setup Private Link for ADLS Gen2
2 - Deploy Azure Databricks in your VNET
Please note that it’s possible to configure more than one Private Link per Azure Data service, which allows you to build an architecture that conforms to your enterprise governance needs.
Option 2: Azure Virtual Network Service Endpoints
As per Azure documentation, Virtual Network (VNET) service endpoints extend your virtual network private address space. The endpoints also extend the identity of your VNet to the Azure services over a direct connection. Endpoints allow you to secure your critical Azure service resources to only your virtual networks. Traffic from your VNet to the Azure service always remains on the Microsoft Azure network backbone.
Service endpoints provide the following benefits (source):
Improved security for your Azure service resources
Private address space for different virtual networks can overlap with each other. You can't use overlapping network space to uniquely identify traffic that originates from a particular VNET. Once service endpoints are enabled for the subnets in your VNET, you can add a virtual network firewall rule to secure the Azure data services by extending your VNET identity to those resources. Such a configuration helps remove public access to those resources and allowing traffic only from your VNET.
Optimal routing for Azure data service traffic from your virtual network
Today, any routes on your VNET that are used to direct public network-headed traffic via your cloud/on-premises-based virtual appliances are also used for the Azure data service traffic. Service endpoints provide optimal routing for Azure traffic.
Keeping traffic on the Azure network backbone
Service endpoints always direct Azure data service traffic directly from your VNET to the resource on the Microsoft Azure network backbone. Keeping traffic on the Azure network backbone allows you to continue auditing and monitoring outbound Internet traffic from your virtual networks, through forced-tunneling, without impacting data service traffic. For more information about user-defined routes and forced-tunneling, see Azure virtual network traffic routing.
Simple to set up with no management overhead
You no longer need reserved, public IP addresses in your virtual networks to secure Azure data service resources through IP firewall. There are no Network Address Translation (NAT) or gateway devices required to set up the service endpoints. You can configure service endpoints through a simple setup for a subnet. There's no additional overhead to maintaining the endpoints.
Azure Service Endpoint with Azure Databricks
Azure Service Endpoint Considerations
Please consider the following before implementing the service endpoints:
- Does not provide protection against data exfiltration by default.
- Keeps traffic on Azure network backbone i.e public network is not used for any data flow.
- Does not extend your private network address space to Azure Data services.
- Cannot connect privately to Azure Data services in other regions (except for paired regions).
- See the documentation for a detailed list of Azure Service Endpoint benefits and limitations.
Taking the same example as mentioned above for Private Link, and how it could look like with Service Endpoints. In this case, Azure Storage Service Endpoint could be configured on Azure Databricks subnets and the same subnets could then be whitelisted in ADLS Gen2 firewall rules.
This is how such an environment could be configured:
1 - Setup Service Endpoint for ADLS Gen2
2 - Deploy Azure Databricks in your VNET
3 - Configure IP firewall rules on ADLS Gen2
Getting Started with Secure Azure Data Access
We discussed a couple of options available to access Azure data services securely from your Azure Databricks environment. Based on your business specifics, you could either use Azure Private Link or Virtual Network Service Endpoints. Once the network connectivity approach is finalized, you could utilize secure auth approaches to connect to those resources:
- Please access Azure Databricks documentation for specific data sources.
- Consider using secrets to hide any credentials.
- When possible, access ADLS Gen2 using Azure AD credential passthrough.
In the next blog in this series, we’ll dive deep into how one could set up a buttoned-up locked down environment to prevent data exfiltration (in other words, implement a data loss prevention architecture). It would utilize a mix of the above discussed options and Azure Firewall. Please reach out to your Microsoft or Databricks account teams for any questions.