by Bhavin Kukadia, Abhinav Garg and Michal Marusan
Azure Databricks is a Unified Data Analytics Platform that is a part of the Microsoft Azure Cloud. Built upon the foundations of Delta Lake, MLFlow , Koalas and Apache Spark, Azure Databricks is a first party service on Microsoft Azure cloud that provides one-click setup, native integrations with other Azure services, interactive workspace, and enterprise-grade security to power Data & AI use cases for small to large global customers. The platform enables true collaboration between different data personas in any enterprise, like Data Engineers, Data Scientists, Data Analysts and SecOps / Cloud Engineering.
In this blog which is first in a series of two, we'll provide an overview of Azure Databricks architecture and how customers could connect to their own-managed instances of Azure data services in a secure manner.
Azure Databricks is a managed application on Azure cloud. At a high-level, the architecture consists of a control / management plane and data plane. The control plane resides in a Microsoft-managed subscription and houses services such as web application, cluster manager, jobs service etc. In the default deployment, the data plane is a fully managed component in customer's subscription that includes a VNET, NSG and a root storage account known as DBFS.
The data plane could also be deployed in a customer-managed VNET, to allow the SecOps and Cloud Engineering teams build security & network architecture for the service as per their enterprise governance policies. This capability is called Bring Your Own VNET or VNET Injection. The picture shows a representative view of such customer architecture.
Enterprise Security is a core tenet of building software at both Databricks and Microsoft, and thus it’s considered as a first-class citizen in Azure Databricks. In the context of this blog, secure connectivity refers to ensuring that traffic from Azure Databricks to Azure data services remains on the Azure network backbone, with the inherent ability to whitelist Azure Databricks as an allowed source. As a security best practice, we recommend a couple of options which customers could use to establish such a data access mechanism to Azure Data services like Azure Blob Storage, Azure Data Lake Store Gen2, Azure Synapse Data Warehouse, Azure CosmosDB etc. Please read further for a discussion on Azure Private Link and Service Endpoints.
The most secure way to access Azure Data services from Azure Databricks is by configuring Private Link. As per Azure documentation - Private Link enables you to access Azure PaaS Services (for example, Azure Storage, Azure Cosmos DB, and SQL Database) and Azure hosted customer/partner services over a Private Endpoint in your virtual network. Traffic between your virtual network and the service traverses over the Microsoft network backbone, eliminating exposure from the public Internet. You can also create your own Private Link Service in your virtual network (VNet) and deliver it privately to your customers. The setup and consumption experience using Azure Private Link is consistent across Azure PaaS, customer-owned, and shared partner services. For details, please refer to this.
See below on how Azure Databricks and Private Link could be used together.
Azure Databricks and Azure Data Service Private Endpoints in separate VNETs
Azure Databricks and Azure Data Service Private Endpoints in same VNET
Please consider the following before implementing the private endpoint:
One example of where one could use Private Link is when a customer uses a few Azure Data services in production along with Azure Databricks, like Blob Storage, ADLS Gen2, SQL DB etc. The business would like the users to query the masked aggregated data from ADLS Gen2, but restrict them from making their way to the unmasked confidential data in other data sources. In that case, a private endpoint could be established only for ADLS Gen2 service using any of the sub-options discussed above.
This is how such an environment could be configured:
1 - Setup Private Link for ADLS Gen2
2 - Deploy Azure Databricks in your VNET
Please note that it’s possible to configure more than one Private Link per Azure Data service, which allows you to build an architecture that conforms to your enterprise governance needs.
As per Azure documentation, Virtual Network (VNET) service endpoints extend your virtual network private address space. The endpoints also extend the identity of your VNet to the Azure services over a direct connection. Endpoints allow you to secure your critical Azure service resources to only your virtual networks. Traffic from your VNet to the Azure service always remains on the Microsoft Azure network backbone.
Improved security for your Azure service resources
Private address space for different virtual networks can overlap with each other. You can't use overlapping network space to uniquely identify traffic that originates from a particular VNET. Once service endpoints are enabled for the subnets in your VNET, you can add a virtual network firewall rule to secure the Azure data services by extending your VNET identity to those resources. Such a configuration helps remove public access to those resources and allowing traffic only from your VNET.
Optimal routing for Azure data service traffic from your virtual network
Today, any routes on your VNET that are used to direct public network-headed traffic via your cloud/on-premises-based virtual appliances are also used for the Azure data service traffic. Service endpoints provide optimal routing for Azure traffic.
Keeping traffic on the Azure network backbone
Service endpoints always direct Azure data service traffic directly from your VNET to the resource on the Microsoft Azure network backbone. Keeping traffic on the Azure network backbone allows you to continue auditing and monitoring outbound Internet traffic from your virtual networks, through forced-tunneling, without impacting data service traffic. For more information about user-defined routes and forced-tunneling, see Azure virtual network traffic routing.
Simple to set up with no management overhead
You no longer need reserved, public IP addresses in your virtual networks to secure Azure data service resources through IP firewall. There are no Network Address Translation (NAT) or gateway devices required to set up the service endpoints. You can configure service endpoints through a simple setup for a subnet. There's no additional overhead to maintaining the endpoints.
Azure Service Endpoint with Azure Databricks
Please consider the following before implementing the service endpoints:
Taking the same example as mentioned above for Private Link, and how it could look like with Service Endpoints. In this case, Azure Storage Service Endpoint could be configured on Azure Databricks subnets and the same subnets could then be whitelisted in ADLS Gen2 firewall rules.
This is how such an environment could be configured:
1 - Setup Service Endpoint for ADLS Gen2
2 - Deploy Azure Databricks in your VNET
3 - Configure IP firewall rules on ADLS Gen2
We discussed a couple of options available to access Azure data services securely from your Azure Databricks environment. Based on your business specifics, you could either use Azure Private Link or Virtual Network Service Endpoints. Once the network connectivity approach is finalized, you could utilize secure auth approaches to connect to those resources:
In the next blog in this series, we’ll dive deep into how one could set up a buttoned-up locked down environment to prevent data exfiltration (in other words, implement a data loss prevention architecture). It would utilize a mix of the above discussed options and Azure Firewall. Please reach out to your Microsoft or Databricks account teams for any questions.