Databricks Unified Analytics Platform, built by the original creators of Apache SparkTM, brings Data Engineers, Data Scientists and Business Analysts together with data on a single platform. It allows them to collaborate and create the next generation of innovative products and services. In order to create the analytics needed to power these next-gen products, Data Scientists and Engineers need access to various sources of data. Apart from data in the cloud block storage such as S3, this data that they need is often located on services such as databases or even come from streaming data sources located in disparate VPCs.
For security purposes, Databricks Apache Spark clusters are deployed in an isolated VPC dedicated to Databricks within the customer's account. In order to run their data workloads, there is a need to have secure connectivity between the Databricks Spark Clusters and the above data sources.
It is straightforward for Databricks clusters located within the Databricks VPC to access data from AWS S3 which is not a VPC specific service. However, we need a different solution to access data from sources deployed in other VPCs such as AWS Redshift, RDS databases, streaming data from Kinesis or Kafka. This blog will walk you through some of the options you have available to access data from these sources securely and their cost considerations for deployments on AWS. In order to establish a secure connection to these data sources, we will have to configure the Databricks VPC with either one of the following two available options :
A secure connection between the Databricks cluster and the other non-S3 external data sources can be established by using VPC peering. AWS defines VPC peering as “a networking connection between two VPCs that enables you to route traffic between them using private IPv4 addresses or IPv6 addresses”. For more details see AWS documentation here.
When the VPC peering option is chosen, one has to take the following factors into consideration:
Here is an example of a situation where VPC peering option would be ideal - You are tasked with creating a data table that will pull the data from a Kafka cluster and store the aggregated results on Aurora database both located on the same VPC external to the Databricks VPC. Assuming no other security limitations, you can use VPC peering connection between the Databricks VPC and the external VPC where the data sources are located and then connect to both sources.
The second option available to connect with the non-S3 data sources would be to use an AWS Privatelink. AWS defines PrivateLink as a service that “provides private connectivity between VPCs, AWS services, and on-premises applications, securely on the Amazon network. AWS PrivateLink simplifies the security of data shared with cloud-based applications by eliminating the exposure of data to the public Internet.”
One has to take the following considerations into account while choosing the Privatelink option:
Here is an example of when you would use a AWS privatelink. You have a production VPC with many data sources such as Redshift, Aurora and MySQL. The business would like to query the data from the MySQL database, but not expose confidential data stored in Redshift or Aurora. Using privatelink, you can open a connection from Databricks clusters to MySQL, allowing your users to access MySql securely while restricting connectivity to Redshift and Aurora.
Manual or programmatic VPC peering: https://docs.databricks.com/administration-guide/cloud-configurations/aws/vpc-peering.html
Manual Privatelink setup: https://docs.aws.amazon.com/vpc/latest/privatelink/endpoint-services-overview.html
Databricks resources on connecting to data sources
Once a network connection via VPC Peering or Privatelink is established, authentication with the specific data source or service can then be setup. Please access Databricks on AWS documentation for the specific data sources that you need access to. Wherever possible consider using secrets to keep your connection secure. Using the correct connection options based on your needs reduces overall complexity and helps Data Scientist and Data Engineers have access to the data they need in a secure manner.