In this blog, you will learn a series of steps you can take to harden your Databricks deployment from a network security standpoint, reducing the risk of Data exfiltration happening in your organization.
Data Exfiltration is every company's worst nightmare, and in some cases, even the largest companies never recover from it. It's one of the last steps in the cyber kill chain, and with maximum penalties under the General Data Protection Regulation (GDPR) of €20 million (~ $23 million) or 4% of annual global turnover – it's arguably the most costly.
But first, let's define what data exfiltration is. Data exfiltration, or data extrusion, is a type of security breach that leads to the unauthorized transfer of data. This data often contains sensitive customer information, the loss of which can lead to massive fines, reputational damage, and an irreparable breach of trust. What makes it especially difficult to protect against is that it can be caused by both external and internal actors, and their motives can be either malicious or accidental. It can also be extremely difficult to detect, with organizations often not knowing that it's happened until their data is already in the public domain and their logo is all over the evening news.
There are tons of reasons why preventing data exfiltration is top of mind for organizations across industries. One that we often hear about are concerns over platform-as-a-service (PaaS). Over the last few years, more and more companies are seeing the benefits of adopting a PaaS model for their enterprise data and analytics needs. Outsourcing the management of your data and analytics service can certainly free up your data engineers and data scientists to deliver even more value to your organization. But if the PaaS service provider requires you to store all of your data with them, or if it processes the data in their network, solving for data exfiltration can become an unmanageable problem. In that scenario, the only assurances you really have are whatever industry-standard compliance certifications they can share with you.
The Databricks Lakehouse platform enables customers to store their sensitive data in their existing AWS account and process it in their own private virtual network(s), all while preserving the PaaS nature of the fastest growing Data & AI service in the cloud. And now, following the announcement of Private Workspaces with AWS PrivateLink, in conjunction with a cloud-native managed firewall service on AWS, customers can benefit from a new data exfiltration protection architecture- one that's been informed by years of work with the world's most security-conscious customers.
We recommend a hub and spoke topology reference architecture, powered by AWS Transit Gateway. The hub will consist of a central inspection and egress virtual private cloud (VPC), while the Spoke VPC houses federated Databricks workspaces for different business units or segregated teams. In this way, you create your own version of a centralized deployment model for your egress architecture, as is recommended for large enterprises.
A high-level view of this architecture and the steps required to implement it are provided below:
Databricks enterprise security and admin features allow customers to deploy Databricks using your own Customer Managed VPC, which enables you to have greater flexibility and control over the configuration of your spoke architecture. You can also leverage our feature-rich integration with Hashicorp Terraform to create or manage deployments via Infrastructure-as-a-Code, so that you can rinse and repeat the operation across the wider organization.
Prior to deploying the workspace, you'll need to create the following prerequisite resources in your AWS account:
Once you've done that, you would need to register the VPC Endpoints for Databricks backend services by following steps 3-6 of the Enable AWS PrivateLink documentation, before creating a new workspace using the workspace API.
In the example below, a Databricks workspace has been deployed into a spoke VPC with a CIDR range of 10.173.0.0/16 and two subnets in different availability zones with the ranges 10.173.4.0/22 and 10.173.8.0/22. VPC Endpoints for the Databricks backend services have also been deployed into a dedicated subnet with a smaller CIDR range- 10.173.12.0/26. You can use these IP ranges to follow the deployment steps and diagrams below.
Over the last decade, there have been many well-publicized data breaches from incorrectly configured cloud storage containers. So, in terms of major threat vectors and mitigating them, there's no better place to start than by setting up your VPC Endpoints.
As well as setting up your VPC endpoints, it's also well worth considering how these might be locked down further. Amazon S3 has a host of ways you can further protect your data, and we recommend you use these wherever possible.
In the AWS console:
Service Name | Endpoint Type/b> | Policy | Security Group |
com.amazonaws. |
Gateway | Leave as "Full Access" for now | N/A |
com.amazonaws. |
Interface | Leave as "Full Access" for now | The Security Group for the Customer Managed VPC created above |
com.amazonaws. |
Interface | Leave as "Full Access" for now | The Security Group for the Customer Managed VPC created above |
Note - If you want to add VPC endpoint policies so that users can only access the AWS resources that you specify, please contact your Databricks account team as you will need to add the Databricks AMI and container S3 buckets to the Endpoint Policy for S3.
Please note that applying a regional endpoint to your VPC will prevent cross-region access to any AWS services- for example S3 buckets in other AWS regions. If cross-region access is required, you will need to allow-list the global AWS endpoints for S3 and STS in the AWS Network Firewall Rules below
For data cataloging and discovery, you can either leverage a managed Hive Metastore running in the Databricks Control Plane, host your own, or use AWS Glue. The steps for setting these up are fully documented in the links below.
Next, you'll create a central inspection/egress VPC, which once we've finished should look like this:
For simplicity, we'll demonstrate the deployment into a single availability zone. For a high availability solution, you would need to replicate this deployment across each availability zone within the same region.
Name | Description | Inbound rules | Outbound rules |
Inspection-Egress-VPC-SG | SG for the Inspection/Egress VPC | Add a new rule for All traffic from 10.173.0.0/16 (the Spoke VPC) | Leave as All traffic to 0.0.0.0/0 |
Because you're going from private to public networks you will need to add both a NAT and Internet Gateway. This helps from a security point of view because the NAT will sit on the trusted side of the AWS Network Firewall, giving an additional layer of protection (a NAT GW not only gives us a single external IP address, it will also refuse unsolicited inbound connections from the internet).
Name | Subnet | Elastic IP allocation ID |
Egress-NGW-1 | NGW-Subnet-1 | Allocate Elastic IP |
At the end of this step, your central inspection/egress VPC should look like this:
Now that you've created the networks, it's time to deploy and configure your AWS Network Firewall.
Name | VPC | Firewall subnets | New firewall policy name |
Egress-Inspection-VPC | Firewall-Subnet-1 | Egress-Policy |
In order to configure our firewall rules, you're going to use the AWS CLI. The reason for this is that in order for AWS Network Firewall to work in a hub & spoke model, you need to provide it with a HOME_NET variable - that is the CIDR ranges of the networks you want to protect. Currently, this is only configurable via the CLI.
aws network-firewall list-firewalls
aws network-firewall create-rule-group --rule-group-name Databricks-FQDNs --rule-group file://allow-list-fqdns.json --type STATEFUL --capacity 100
Finally, add some basic deny rules to cater for common firewall scenarios such as preventing the use of protocols like SSH/SFTP, FTP and ICMP. Create another JSON file, this time called "deny-list.json." An example of a valid rule group configuration would be as follows:
aws network-firewall create-rule-group --rule-group-name Deny-Protocols --rule-group file://deny-list.json --type STATEFUL --capacity 100
Now add the following rule groups to the Egress-Policy created above.
Our AWS Network Firewall is now deployed and configured, all you need to do now is route traffic to it.
These steps walk through creating a firewall configuration that restricts outbound http/s traffic to an approved set of Fully Qualified Domain Names (FQDNs). So far, this blog has focussed a lot on this last line of defense, but it's also worth taking a step back and considering the multi-layered approach taken here. For example, the security group for the Spoke VPC only allows outbound traffic. Nothing can access this VPC unless it is in response to a request that originates from that VPC. This approach is enabled by the Secure Cluster Connectivity feature offered by Databricks and allows us to protect resources from the inside out.
At the end of this step, your central inspection/egress VPC should look like this:
Now that our spoke and inspection/egress VPCs are ready to go, all you need to do is link them all together, and AWS Transit Gateway is the perfect solution for that.
First, let's create a Transit Gateway and link our Databricks data plane via TGW subnets:
Transit Gateway ID | Attachment type | Attachment name tag | VPC ID | Subnet IDs |
Hub-TGW | VPC | Spoke-VPC-Attachment | Customer Managed VPC created above | TGW-Subnet-1 and TGW-Subnet-2 created above |
Repeat the process to create Transit Gateway attachments for the TGW to Inspection/Egress-VPC:
Transit Gateway ID | Attachment type | Attachment name tag | VPC ID | Subnet IDs |
Hub-TGW | VPC | Inspection-Egress-VPC-Attachment | Inspection-Egress-VPC | TGW-Subnet-1 |
All of the logic that determines what routes where via a Transit Gateway is encapsulated within Transit Gateway Route Tables. Next we're going to create some TGW route tables for our Hub & Spoke networks.
Now associate these route tables and, just as importantly, create some routes:
The Transit Gateway should be set up and ready to go, now all that needs to be done is update the route tables in each of the subnets so that traffic flows through it.
To ensure there are no errors, we recommend some thorough testing before handing the environment over to end-users.
First you need to create a cluster. If that works, you can be confident that your connection to the Databricks secure cluster connectivity relay works as expected.
Next, check out Get started as a Databricks Workspace user, particularly the Explore the Quickstart Tutorial notebook as this is a great way to test the connectivity to a number of different sources- from the Hive Metastore to S3.
As an additional test, you could use %sh in a notebook to invoke curl and test connectivity to each of the required URLs.
Now go to Firewalls in the AWS console and select the Hub-Inspection-Firewall you created above. On the Monitoring tab you should see the traffic generated above being routed through the firewall:
If you want a more granular level of detail, you can set up specific logging & monitoring configurations for your firewall, sending information about the network traffic flowing through your firewall and any actions applied to it to sinks such as CloudTrail or S3. What's more, by combining these with Databricks audit logs, you can build a 360-degree view of exactly how users are using their Databricks environment, and set-up alerts on any potential breaches of the acceptable use policy.
As well as positive testing, we recommend doing some negative tests of the firewall too. For example:
HTTPS requests to the Databricks Web App URL are allowed.
Whereas HTTPS requests to google.com fail.
Finally, it's worth testing the "doomsday scenario" as far as data exfiltration protection is concerned- that data could be leaked to an S3 bucket outside of your account. Since the global S3 URL has not been allow-listed, attempts to connect to S3 buckets outside of your region will fail:
And if you combine this with endpoint policies for Amazon S3, you can tightly enforce which S3 buckets a user can access from Databricks within your region too.
Depending on how you set up the Customer Managed VPC, you might find that there are now some unused resources in it, namely:
Once you have completed your testing, it should be safe to detach and delete these resources. Before you do, it's worth double-checking that your traffic is routing through the AWS Network Firewall as expected, and not via the default NAT Gateway. You can do this in any of the following ways:
If your Databricks workspace continues to function as expected (for example you can start clusters and run notebook commands), you can be confident that everything is working correctly. In the event of a configuration error, you might see one of these issues:
# | Issue | Things to check |
1 | Cluster creation fails after a few minutes with an error saying that it has Failed Fast |
|
2 | Cluster creation takes a long time and eventually fails with Container Launch Failure |
|
3 | Cluster creation takes a long time and eventually times out with Network Configuration Failure |
|
If clusters won't start, and more in-depth troubleshooting is required, you could create a test EC2 instance in one of your Customer Managed VPC subnets and use commands like curl to test network connectivity to the necessary URLs.
You can no longer put a price on data security. The cost of lost exfiltrated data is often the tip of the iceberg, compounded by the cost of long-term reputational damage, regulatory backlash, loss of IP, and more...
This blog shows an example firewall configuration and how security teams can use it to restrict outbound traffic based on a set of allowed FQDNs. It's important to note however that a one-size-fits-all approach will not work for every organization based on risk profile or sensitivity of data. There's plenty that can be done to lock this down further. As an example, this blog has focused on how to prevent data exfiltration from the data plane, which is where the vast majority of the data is processed and resides. But you could equally implement an architecture involving Front-end (user to workspace) AWS PrivateLink connections to restrict access to locked down VMS or Amazon Workspaces, therefore helping to mitigate any risk associated with the subsets of data that are returned to the Control Plane.
Customers should always engage the right security and risk professionals in their organizations to determine the appropriate access controls for each individual use case. This guide should be seen as a starting point, not the finishing line.
The war against cybercriminals and the many cyber threats faced in this connected, data-driven world is never won, but there are step-wise approaches like protecting against data exfiltration that you can take to fortify your defense.
This blog has focussed on how to prevent data exfiltration with an extra-secure architecture on AWS. But the best security is always based on a defense-in-depth approach. Learn more about the other platform features you can leverage in Databricks to protect your intellectual property, data and models. And learn how other customers are using Databricks to transform their business, and better still, how you can too!