Databricks Lakehouse Platform Governance and Security Fundamentals

May 27, 2021 11:00 AM (PT)

Attacks on enterprise data can come from employees with access to company systems or external private or state-sponsored malicious actors. Some of the larger well-known data breaches were planned and executed over months and years of preparation, and in all cases, the victims were unaware until it was too late — the damage was done. Any comprehensive solution an enterprise adopts to mitigate the risk have to address all four areas of people, process, policy and platform. Most may spend a lot of time managing people, policies and processes. But what happens when you start with your data platform, the core of your entire data architecture, and work your way out? In this session, learn the fundamentals of governance and security for your cloud data and analytics platform, including extending cloud identity management, setting up private links, monitoring access and costs, and ensuring the right policies are enforced for every workspace.

In this session watch:
Abhinav Garg, Product Manager, Databricks
Tianyi Huang, Engineering Manager, Databricks

 

Transcript

Abhinav Garg: Hi, I’m Abhinav Garg, Product Manager at Databricks, and I have Tianyi Huang with me who is an Engineering Manager at Databricks. We both partner to deliver platform security and infrastructure capabilities for our Databricks Lakehouse Platform. As you have seen, the top companies in the world by market cap globally are inherently data companies, the software companies who have big businesses by using data. And talking to global enterprises, small to large scale, our point of view is that the biggest companies in future are all going to be data companies, in every market segment, in every industry vertical under the hood at least. So for them to allow all their users to be able to access all the data in the right way and while conforming to their enterprise governance policies, we have built in many capabilities for them to do so. So we are going to focus on a lot of those capabilities and what those focus themes are.
Before we dive in into those capabilities, I just want to do a refresh of what the Databricks Lakehouse Platform is. So what we have developed is a simple and open Lakehouse Platform where organizations can allow all their data personas, data users to come in and collaborate to create data products. These can be data engineers, data scientists, data analysts, business analysts, and ML engineers, they can all use those different products that are available as part of the Lakehouse Platform in terms of data engineering, BI and SQL analytics, real-time applications, data science and machine learning to create their data products.
And under the hood, there is that data management governance layer using data lake which basically provides reliability, performance and governance to how the data is being accessed to all these users. And that sits on top of that open data lake, which is built on open standards. But all of this platform or all of these different products and layers sit on top of a platform which is the infrastructure layer. And that’s basically at the bottom it, sits on top of the cloud provider infrastructure. So our product is a multicloud product that’s available on three major public clouds, AWS, Azure and GCP. And that’s where we have basically baked in the security capabilities that propagate through all the different layers. So in our themes, our focus is to basically make that platform infrastructure layer simple, secure, and scalable. So we are going to focus on the security aspects of that platform layer and how do those different capabilities propagate through those different layers of the stack.
So before we go into these specific capabilities, I just want to highlight what are the must and should haves of platform governance. Again, this is based on all the analysis that we have done over the years, working with our global customers across industry verticals like financial services, healthcare, retail, adtech and many more. Like what are the must and should haves of platform governance so that they can onboard their hundreds of thousands of users on using the Databricks Lakehouse Platform.
So who cares about this platform governance needs that you’re going to talk about? So these are different personas that we feel and that we have talked to day in and out that care about the platform governance capabilities. And it’s definitely not restricted to this. In an organization dealing with data and building data products, every persona, including data engineers, data scientists, data analysts, should care about governance, but really the prominent stakeholders are the executive leadership who are answerable to their investors, to the people who are the prominent stakeholders of the company to make sure that the goodwill is maintained, that there’s no loss of data, there’s no leakage of data. So executive leadership are the prominent stakeholders here.
Then the data platform leadership, because how it percolates from executive leadership to the data platform leadership, because they are maintaining the platform, they are provisioning the platform, maintaining it, managing it for all the users in the company and they are onboarding them. So it is basically the primary job to make sure that the platform that they’re adopting for all their data products is secure, is governed, that there are capabilities for them to allow to do it easily.
And what sort of governance controls have to be complied with? That’s really coming from three different departments. And again, different organizations have different structures, so these could be different departments. So there’s infosec, sometimes it’s just called security eng, or the compliance team, cloud architecture infrastructure and many large organizations, there is an architecture review board or the enterprise architecture team of which cloud architecture is part of. So they definitely drive a lot of governance controls from a cloud perspective.
And then, because it’s all about including sensitive data, legal and IT risk are also involved. So all those three or multiple departments across those different orgs, they enforce different guardrails and controls that the data platform leadership and the data platform team basically have to make sure that the platform conforms to. So all of those different stakeholders, but as I mentioned, really everyone using the data and using the data Lakehouse Platform, like Databricks should care about overall governance.
Now, the first theme is the data access control, so who can access what? So basically, if there are different tables on cloud storage or on a cloud database, who can access those tables, during what times? There might be a particular table that should be accessible only for a particular week and not beyond that, and what columns in that table? So there’s whole aspect of access control to be set up, but then organizations are onboarding hundreds of users. So we are going to focus on this theme later. But basically, this is the first thing that comes up when we are talking about platform governance security through those stakeholders.
Cost control and chargeback. Controlling costs, attributing those costs to different business teams, different project teams, different users, that’s a big topic whenever we talk to these stakeholders. So how can they minimize the cost? Track naturally which team is consuming how much? So that’s a big topic and we have built capabilities in our platform that address those requirements. So we are going to talk about that in a bit.
The third major theme that comes up is perimeter security. So yes, there are data access controls and admins should be able to control who is able to access what table or what files, but then there is an overall network perimeter that the cloud architecture, cloud infrastructure, network architecture folks in an organization are responsible for. They have to make sure that the platform is not able to access any unauthorized data source. So perimeter security is super important especially when it comes to cloud. So we have built great capabilities in the product for organizations to onboard. So we are going to talk about that as well. I’m going to hand it over to Tianyi to talk about the next three major themes, and then we are going to dive right into other features. Tianyi.

Tianyi Huang: All right, thank you, Abhinav. So another theme is data encryption. And the question is, whether I can encrypt my data using my own keys, that will provide an extra layer of control to protect against the risk of data breach, especially for regulated industries that process the sensitive or confidential data. As now, I could just revoke the key and then that’s provoked any access to the encrypted data in case of emergency.
Next dimension is auditability, which is, can I track who is doing what at what time? Which is also a common ask for legal and security compliance reasons. As we wouldn’t need the evidence of user actions both for proactive analysis and alerting, but also for legal coverage when there was any incidents like data leakage.
And last but not least, for certain industry verticals like healthcare, financial, or government agencies, they’re often compliance requirements and certifications that the data platform needs to satisfy in order to process the sensitive data. And the question then is, do you have the necessary stamps? Otherwise, we simply cannot adopt the service. And that is the last dimension at the high level. And I will then hand it back to Abhinav for the deep dive.

Abhinav Garg: Cool, thanks, Tianyi. All right, so now we have taken a look at what are those major themes, straightly, the must and should haves, which different controls and aspects fall under from a governance perspective. So we are going to touch upon each of those themes and the capabilities that we have built it on our platform, and they all work well with each other in an integrated fashion and are available to our users on all three major public clouds.
So starting out with the data access control part, who can access what? It all starts with identity and access management. Now, one might think, what does it have to do IM? To actually start applying controls on who can access what to a data Lakehouse Platform, you have to make sure that the right identities are provisioned in the platform. The users are able to access the platform securely and in a fashion that complies with organizations governance policies.
So admins or the Databricks admins can configure single sign-on with their IdP once per Databricks account, that’s a multicloud account. Then they can sync those different users and groups across different workspaces and this multicloud workspaces. So they can be an AWS, Azure or GCP workspace, and they can appropriately, based on what user is entitled to, they can sync to those particular workspaces. And then, those users, by the way of their direct entitlements or by the group entitlements, they can collaborate with each other in building data products, and sharing data products with each other, and within the organization. And basically, this works at a multicloud level.
Finally, there’s an aspect of the users are dedicated in the platform, querying data, building a data product, doing some machine learning, creating machine learning models, squaring it, but a lot of this process has to be done in automated fashion to build scalable data products, to build reliable and consistent data products. And how should those automated workflows run? Those automated workflows cannot run with the user principles. So which is why we have the notion of service principles or service accounts, or different people call the system identities in the platform. So those are available again on our multicloud products to run those automated workloads.
The second level of metacontrol in terms of data access control comes in the form of managed catalog as we’ll have seen during the managed catalog session. And if you have not seen that, I really encourage you to attend the managed catalog session and different sessions related to that. But basically, the idea is that there’s a single pane of glass which controls who can access what table and who can access what file. So today in the platform, we have a control called table access controls, using which admins can grant permissions on tables for different users. And then, there is a second thing called user identity pass through, using which users can use their own cloud-native identity to access the data, to access the files.
By building a managed catalog offering in the product, we are basically streamlining and aggregating everything in that single pane of glass, where there is this SQL-based access control for all table-based accesses for whether those are managed tables, or whether those are external tables on the cloud storage, or anti pass-through base access as well to the files. So all will be enforced by the managed catalog. And it’s not just basically an enforcement layer or an access control layer. Managed catalog has many cool bits to it in terms of the data discovery, the attribute-based access control by the way of policies. So all those cool capabilities are being made available via the managed catalog offering.
So the overall data security end goal of managed catalog is basically to offer fine-grained security directly over the data lake storage and provide richer controls using the SQL standard, but it should work with all the languages that are available in the data Lakehouse Platform, and naturally without any API restrictions. So there are column role-based, attributes-based access controls. Now, what are attribute-based access controls? Think about, every user or a group has some property, it may be an organizational unit that they’re a part of, or another particular tag that’s unique to their particular project. So then, admins can create policies based on those properties, and as new users are added to those roles or to the platform, those policies automatically apply to them.
So it makes way for a very scalable and consistent application of access-controlled structure across different users and groups. And with all of these access controls, the admins, the security folks, the infrastructure folks, they are interested in seeing, “Okay, who actually has that access level? When was it set up? Who is accessing these different tables?” So there is reliable audit logging, so that’s part of that security end goal. And the fourth, our partners, the third-party security products to be able to integrate with the managed catalog seamlessly in providing those different capabilities. And it’s already worked really well with those products. And the idea is that managed catalog, one should be able to share that across multiple workspaces. So these can be multicloud workspaces across AWS, Azure, and GCP.
So now we have looked at the data access control, I’m going to talk about the capabilities under the theme of cost control and chargeback. Again, this relates to how to minimize costs in general, and then, how can teams and projects attribute costs. So if you look at this diagram, from a cost management governance perspective, this is how the flow is within the product array and these are the different capabilities there. So basically, at a high-level, admins can create pools of VMs which are basically available for different clusters to be created. Think of these as [inaudible] nodes so that the clusters can come up super fast in seconds. And admins can then create policies, basically, which are based on those pools.
So now, a policy can indicate that this is an ETL policy, or this is a data science policy, or SQL policy. Now, admins can configure budgets at a pool level or tags at a pool level. And by the way of policies, those budgets and tags are available to all the clusters that emanate from a policy. So when users are creating clusters, they are creating clusters by following a particular policy that they’re entitled to. So now, automatically, the budget and the tags created at the pool level automatically propagate through the clusters. And from a reporting standpoint, what we do from a platform perspective is we propagate all those tags and budget information, add the users logging level and the users monitoring level.
So there are capabilities in terms of it, admins can go into the account console, the multicloud account console, and they can see who is using what, they can actually do a slice and dice and full analysis in terms of which team is using what, by what workload type, by what SKU, et cetera, and by what workspace, et cetera. So there’s that in product analysis that they can do. They can also download that user’s data on their machine and they can just run their own tools on top of those data downloads, or they can also configure the download of that data to the cloud storage and they can do some automated analysis based on that.
So many of our advanced customers, large customers, because they want to have a holistic view across all their platforms, cloud storage, and data Lakehouse Platform, they configure that usage log-in delivery so that they can run their visualization tools and querying tools across all of those different users’ data. So that’s the overall cost management governance workflow from pools to clusters by enforced policies, and then to the usage monitoring view.
Now, the cluster policies are a major part of that cost manager and governance model that I just showed just a bit back on the last slide. Now, cluster policies auto mechanics where admins can use them to enforce or suggest configurations and create a cluster. So admins can add tags at policy level to make sure that any clusters that are being created from that policy have those tags or admins can just add tags at the pool level, and then they can configure a pool in the policy so that any clusters created from that policy are emanating from that pool. And again, so that all clusters then have those particular tags from the pool.
So that’s a way to enforce that all particular clusters that are being created have required tags at the project level, at the team level, at the business unit level, so that there can be appropriate chargeback and attribution of cost. And this attribution is not just at the Databricks level, it’s at the cloud storage and compute level also because we propagate these tags due to the compute resources. Cluster policies can also be used to enforce security and compliance, stability and supportability capabilities, and requirements basically.
As I mentioned during the cost management governance discussion, once the tags and the budget group has been enforced at a pool, or a cluster level or at a workspace level, once users have run their workloads over a period of time, admins can actually view the data or view the actual usage via the multicloud accounts console. And this is just one example where they can slice and dice over the time period, they can look at by SKU, by workspace, and by other different properties. And they can look at it in a tablet form, they can view charts as shown here.
So basically, the idea is to make it as simple as possible for admins to be able to view high-level data here, but they can very well download that data as well to do detailed cost chargeback analysis and attribution analysis. Or they can, as I mentioned, configure delivery of that user data to their cloud storage too. So that’s overall usage log capability basically. So now I’m going to hand it over to Tianyi who is going to talk about perimeter security capabilities and how they apply to users.

Tianyi Huang: All right, thank you, Abhinav. So yeah, in terms of perimeter security, it’s about how the Databricks platform provides a highly secure networking infrastructure and advanced customization options to reduce the surface area of your Databricks workspace, and that’s improved security. And I will focus on three main capabilities here. So the first one is the customer-managed network. By default, Databricks creates and manages the network infra for our customers in a customer’s account to host the clusters. But you can optionally create and manage your own network, which would then allow you to exercise more control over the infrastructure. And thus, help you comply with the cloud security and governance standards your organization may require.
So for example, your org might have some policies that prevent other service providers from creating or deleting networks in your cloud account with the customer-managed network capability, you can then just share a less privileges with Databricks to be compliant. And in addition, you can also get the flexibility to set up more advanced security configurations for your own network. For example, you can use an egress firewall rule, a firewall or a proxy appliance to limit up on traffic and allow list of your data sources.
And then there are even more benefits since you can deploy the workspace into the same network as your data sources or other data applications you have, then data don’t need to transfer outside of the network. Therefore, you got the cost savings benefit, because a lot of times cloud vendors, they charge less for in-network transfer as well as security benefits, because then, data won’t be leaving the network at all. So that’s for the customer managed network capability.
And then, the next one is about secure cluster connectivity from your cloud account that hosts the clusters to Databricks backend services, or we call it the control plane. It ensures that there’s only outbound traffic from your clusters, so that there’s no open inbound port or public IP need to be exposed. How it works is that at cluster creation time, it initiates a connection to the control plane relay, which establishes a reverse tunnel so that the cluster will listen to any commands sent by the control plane. It also uses HTTPS and also a different IP address than what is used for the public API.
For any following operations that the control plane logically initiates such as starting new jobs or upsizing the cluster, these are sent as requests through the information reverse tunnel. Therefore, we don’t need to open any inbound ports or public IP addresses in the clusters network. And with this architecture, there’s also no need for port configuration, or firewall rules or configuring network paring. So it also provides easier network and administration.
The next main capability I want to talk about is even one step further, that is for the private workspaces using Private Link. What it means is that it will restrict any Databricks traffic to be inside the cloud vendor’s backbone networks and also your own networks, that is there’s no exposure to the public internet for your workspace at all. And we also provide an extra layer of authorization, that only your traffic is allowed into your workspace because by default workspace can be accessed from the public internet. For example, one might want to work on her notebooks in a coffee shop but that increases the risk of data hack or data exfiltration.
So now, with private workspace it allows you to plug any gaps that could become reasons for such attacks. And in particular, there are three different layers of private connectivity. The first layer is the private connectivity to the front-end interface of your workspace. So you can ensure that only the user traffic or cloud traffic to notebooks, SQL endpoints, or the REST API, they all translate over the private network.
And the second layer is the private connectivity to the backend interface. That is to ensure that all the cluster traffic, basically between your cloud account and Databricks control plane, they also transit over the private networks. And the third layer is the private connectivity to the data sources from your cluster. So you can also configure that layer to communicate privately. So then you get the full kind of private coverage for your workspace. So with these private connectivities, then, everything related to your workspace communication, they will not go into the public internet at all. So that’s for the perimeter security dimension. And next, I would like to also cover the next thing, which is data encryption.
So enterprises usually have some risk management processes that require protection against any potential data breach especially for the regulated industries or sectors that process personal data, health data, or any other confidential information. So data encryption, using their own keys and the key management capability, are the must-haves here. So with the database platform at a high level, we support encryption for two categories of data. The first is data stored in the control plane and then managed by Databricks. So for example, the workspace notebooks, the secrets, the SQL Analytics queries, and more. For those that are stored within Databricks, using this customer-managed key capability, you can get full control over the keys used to encrypt those data. Then, in case of an emergency, you can revoke the key and revoke any following access to them. And the operations are auditable to make the administration easy, as you can track and also get visibility into the relevant key operations.
So here’s how it works. So there are multiple steps. The first one is you provide Databricks a key, we call it a customer-managed key, using the cloud services key management system. Then, on the Databricks side, we also create and manage a key for each workspace. Then combining the customer manage key and the database managed key, we derive a data encryption key, which is actually being used to encrypt the data like the notebooks. The data encryption key is also cached in memory for a certain amount of time. And if you delete or revoke the key all the following reading and writing to notebooks or other data, will then fail at the end of the cache interval because the data encryption key will no longer be accessible after that. You can also rotate your key either routinely or for a one-off change as a cryptographic best practice. And you can do that without affecting reading or writing into the existing data as a platform will handle the data encryption for you. So that’s for the managed services data encryption.
And then the second high-level category of data is the one stored in the data plane, basically in your account, but also managed by Databricks. We call it the workspace storage. So for example, the DBFS storage or any extra cluster columns or disks. So let’s take maybe DPFS as an example here. Basically, it is a distributed file system mounted to a workspace and available to other clusters, and under the cover, it is actually implemented it using a cloud object-store.
So with the customer-managed key capability, we integrate with both the cloud secret management system and also the cloud storage to encrypt the DBFS data using your own keys. And then, you can get the same level of key control as the control plane data we just mentioned. So for both key rotation and revocation, I mean, you can take care of those operations by yourself. And that’s for the data encryption piece, and I’ll then hand it back to Abhinav for the next thing.

Abhinav Garg: Thanks, Tianyi. So there are a couple of things, remaining and which are supercritical for the security, legal, and risk officers, really. The first one is auditability, can we actually track who’s doing what in the platform? Which users are admins onboarding? What roles are they giving to those users? And many other things that those users are doing in the platform. So to that end, we have two awesome capabilities at the platform across the multicloud product, basically.
So there are low latency audit logs where we actually can ship the audit logs to customers cloud storage and it is guaranteed to be delivered every five minutes. So we call these near real-time or low-latency audit logs. And then customers can ship those logs from their cloud native storage to their assigned tools, if they want to, and then analyze those audit logs do to create an interactive alerts if they want.
So using this capability, customers can do both reactive and proactive analysis in terms of the events that they are interested in. For example, they can monitor who is logging in and who is getting unauthorized logins while trying to log into the workspace? Who are creating clusters and who are terminating clusters? Who’s getting what access to a cluster? What jobs are being created in the platform? Who is creating those jobs?
So all those different types of events attached to each particular concept in the product, that’s available via audit logs. And as I mentioned, these are shippable to cloud storage for customers to do a deep analysis on this. The second capability comes via the SQL analytics product, we call it SQL query history. So it is available with the lakehouse virtual clusters and it deeply integrates with the managed catalog offering that we talked about earlier. Basically, at a high level, all the queries that users are executing, all those queries are basically managed in query history.
So it just accessible to the admins to actually see, “Okay, what tables are being accessed? What sort of filters are being applied to those tables?” If they feel that, “Okay, now, this user shouldn’t be allowed to access those table.” They can basically apply the right level of access control at the managed catalog level. So between query history and the low-latency audit logs, there are like full set of capabilities in terms of providing data to users and admins, to do that required analysis. The last theme for overall governance that legal risk officers, security folks, they deeply care about, is the regulatory compliance, especially in the regulatory industries, like financial services, healthcare, as Tianyi alluded to earlier. They require certain stamps in terms of, which are globally recognized stamps or certifications that every platform or service that they’re onboarding to, should conform to.
So I’m going to talk about what board certifications Databricks Lakehouse Platform already has and what are we already working on to allow all of these different regulated industries to be able to process their most sensitive data, their tier zero data on the platform. So we have got these stamps, and these basically cut across different industry verticals, these cover the needs from the small and medium scale customers to super large organizations across different regulated verticals. So ISO 20718, the SOC 2 Type II, HIPAA alluding to the processing of PHI data by the health care companies, HITRUST again, required by the health care companies, PCI DSS, companies which are basically processing payment data, credit card data. The Databricks Lakehouse Platform is a GDPR ready platform where basically we have built-in capabilities to conform to the GDPR requirements in EU.
So these are the stamps that we already have, plus, we are also part of Azure Government cloud. So Databricks Lakehouse Platform also available as Azure Databricks as Azure Government cloud as well, which is FedRAMP High certified. And that platform is available for production-ready workloads at this point, so different federal and civilian organizations are using this platform today for the most important needs of the United States people. So the different civilian agencies and federal agencies can seamlessly onboard to the platform as part of the first party offering on Azure Government cloud. So these were the different capabilities that we wanted to highlight across the six focus areas that we feel are the must and should-haves in the platform, security, and governance landscape.
Now, we can go into a few demos where we basically want to demo certain capability per cloud, just to showcase how you can use these different capabilities across the multicloud product to make sure that the platform addresses your governance and security requirements. All right, so the next thing that I’m going to do is go through a couple of demos, which basically showcase the capabilities that Tianyi and I have talked about. So out of those six different themes that we have talked about, we are going to cover as much as possible here across three different demos. And what we are going to do is to make it interesting, is show one demo per cloud. So you can see the overall set of capabilities across the multicloud product.
So the first demo is about the perimeter security and how I’m going to actually use the capability of Bring Your Own Networks, your cluster connectivity, and private link on AWS. So if you see here on my screen, I have this Databricks Lakehouse Platform workspace on AWS. And this is my particular URL through which I can access that workspace. And I’m currently connected to public network at this point. And I can reach out to Google and whatnot, basically at this point. Now, if I try and access this workspace using my credentials… sorry [inaudible], and if I try and sign in, you will see that it shows this particular error, that configured privacy settings just allow access for workspace [inaudible] ID over your current network, please contact our administrator for more information.
So what that really means is it’s really an unauthorized error for this particular user. And any user that tries to connect to this workspace, even if they’re allowed to access the workspace, allowed to access objects in this workspace, they will get the same error over a public network, because I have actually created this workspace as private link enabled, where I’ve made sure that there are VPC endpoints for both front-end and backend interfaces of this network. And I can show you that I’m on the public network right now. So I have this terminal open, so if I do a DNS lookup for the workspace you are, it’s the same workspace really. You will see that the DNS chain basically ultimately maps to our load balancers [inaudible] for the Databricks control plane and the public IP.
And there is another particular DNS lookup that I’ve done here. If you see, that relays to the backend interface of the workspace, so this is the relay to which the clusters in the workspace connect to pertaining to the security cluster connectivity feature. So this relay is hosted in the control plane. And again, in the US-East-1 region, this is that endpoint. And if I had time to do a DNS lookup here, it also maps back to load balance [inaudible] and to the public IP, or a load public IP.
So not accessible over public network, what can I do? So in an organization setup, ideally, what would happen is, there is proper routing from organizational network or direct connect in this case of AWS to a VPC endpoint in a transit or [bash-in] VPC. Same thing can be done in Azure, or GCP, or ExpressRoute, or Interconnect.
I do not have that particular setup. So what I’ve done is I’ve created actually a VM in my transit and bash-in VPC. And through that VM, I can show you that I can actually access this workspace. So let me open my remote desktop window here. So now if you see, this is basically a Linux-made VM available on AWS. And if you see here, this is actually the same workspace, and I’m able to access this workspace, via this VM, which is actually created in a transit or bash-in VPC through the VPC endpoint for the front-end interface is reachable. So if I show you the DNS lookup, in this case, for that front-end interface, now, there is no network load balancer at our end, at least it’s not visible via that DNS lookup, and there’s no public IP. It maps to a VPC endpoint IP that actually I have provision in my transit and bash-in VPC.
So that’s how I’m able to access this workspace now because all that traffic just goes over the AWS private backbone really, so that it’s secure and isolated for this particular workspace. And I can access the overall product. I can actually go into my notebooks as well, I can do any queries, run any queries, I can create clusters. I created this cluster yesterday and this is currently terminated now. But basically, it shows you that the whole workspace access is available over the private network. So that’s the VPC endpoint in the transitional bash-in network to access the front-end interface in this case. But what about the back-end interface, the VPC endpoint for the back-end interface.
So if I go into this notebook, you’ll actually see that… let me, [inaudible] this one. So if you do, I have done a DNS lookup for the tunnel URL. So the tunnel, the relay URL that’s hosted on the control plane that we saw the public network, that didn’t have private link in it. But what we do is, when the customer sets up a VPC endpoint for the backend interface, we automatically make sure that it actually maps to this standard.privatelink.adhesions URL basically. And if you see, if I do a DNS lookup for this, this is again, mapping to another VPC endpoint IP. Now, this VPC endpoint is created in my workspace subnets. This is not in the transit or bash-in VPC. So this is for the backend interface through which the clusters for this workspace can connect to the relay, can connect to the REST API in the control plane.
And again, if you see here, the REST API that I’m talking about, if I do a DNS lookup here as well, this is also mapping to another VPC, basically the VPC endpoint IP at this point in the workspace subnet. So basically, what I wanted to showcase here is that I have deployed a workspace in my own private subnets in an AWS VPC. I’m able to connect to the Databricks workspace, the front-end interface over VPC endpoints created in a transit or bash-in VPC, and I’m able to provision clusters and access the relay from those clusters and being able to submit these commands et cetera, easily where the backend interface is connectable over the VPC endpoint. So same capability available across clouds. Now, I’m going to shift gears and going from private link, to database private link to Azure audit logs.
So that’s the second demo that I have is basically showcasing how admins can configure diagnostic settings on audit logs for an Azure Databricks workspace, and actually track who’s doing what in a workspace by sending all that data to Azure Log Analytics workspace, which is a separate service, separate Azure first-party service, and it seamlessly integrated for customers as part of the Azure first-party experience.
So again, the same capability, as again, available across our multicloud products. So if you see there, this is Azure portal actually. I’m logged in to as a portal because it’s a first-party service, and this is a workspace that I have. And if I launch this workspace, basically, this thing opens up. So I have this Azure Databricks workspace, same experience as you saw for AWS, same UX look and feel.
Now, if you go to the diagnostics setting section, you see that there are these different events the admins can log to their storage account, or their event hubs, or their Azure Log Analytics workspace. These are the three different targets that are available via Azure Monitor, which is like the holistic single view, single pane of glass for all monitoring in Azure.
So what I’ve done is I’ve already added a configuration for the Log Analytics workspace here actually, and if I just quickly show you that setting, I’m sending the audit events for all of these different objects for clusters accounts, workspace, and notebook, anything that’s happening within those objects. Now, that’s going to this particular log analytics workspace. I’m not archiving to storage account or streaming it to an event hub. You can definitely do so, basically, if you want to, ship all of this data to assign tool, so you can definitely stream to that assign tool via event hub or basically using storage account as a landing zone.
So let me go to that log analytics workspace, actually. And as you see, this is another first-party service that is seamlessly integrated with Azure Databricks via Azure Monitor. And I go to the log section, let me close this. Basically, this is a query view in log analytics workspace, so I can actually run queries. There’s a particular query language, or DSL called Kusto, so I can run Kusto queries, which pretty much look like SQL queries in this product to see what sort of data is landing. I can run filters, I can run sorts, I can run joins, et cetera, across different objects. Sorry.
So I have a couple of saved queries here, which I’ll just open up. All right, so this is, again, as I showed you this log analytics workspace is attached to my Azure Databricks workspace. Now, if I run this query… oops, no, in 24 hours. So let me actually change it to last 48 hours because that’s when I generated some data. So if you see here now, you’ll see a lot of these start result actions for my two different identities, one, my official identity, one my personal identity. Basically, what’s happening is I wanted to understand who has created a starter, or created a cluster, or edited a cluster in this workspace in the last 48 hours. So that would be a query to do so. And I can run different queries to get different types of other insights from this Databricks clusters table, basically.
And if I run this particular query, it’ll actually give you the distinct actions that are available. So start, delete, create, and resize, so all those different actions are available. So you can actually see, it’s not just about security, it’s also about, “Okay, what actions are actually being performed as part of the platform.” And those are being [inaudible] here.
The second type of analysis, I just want to show here is, okay, I want to understand who has been logging in to this workspace? Whose requests are being rejected actually, due to a bad token or whatnot? So again, let me change it to last 48 hours, [inaudible]. And if you see, it would show you a bunch of logins here, across my official and personal identity. And my personal identity is basically allowed because I’ve added it to the AAD tenant under subscription as a guest user and that’s how I added it to my workspace, it’s all seamlessly integrated with Azure Active Directory.
But in this case, as you see, my official identity has logged into the workspace via the browser, via the front end interface. But I’ve also accessed this workspace via a personal authentication token, basically using the REST API. So as you can see, the account stable or the accounts event data stores who has access to the workspace using what interface, whether that’s via UI, whether that’s via token, or some other mechanism. It also captures, if a user is being added or a user has been converted to admin. So I just want to do these two different sorts of analysis there and show you that there are all these different tables that are available once you configure audit logs or diagnostic settings from an Azure Databricks workspace and link it to our log analytics workspace and Azure and the same community is available as a multicloud product. All right, so I’ll turn it over to Tianyi to show the third demo which is actually about GCP the most recent cloud provider that we had launched a platform on, Tianyi.

Tianyi Huang: Alright, thank you, Abhinav. So next I’m going to demo the account console and the usage views. So the console is where all the account level administration happens. Basically, it is the single pane of glass that you can manage all of your workspaces, identities, and SSO, and also, of course, the usage visibility, and control which I would focus on in this demo. So as you can see, I’m currently at the login page for the GCP account console. And by the way, this account console is also available on multiple clouds. So here, I already have the SSO integration for this account with my Google identities. So you can see this blue signing with Google button that you are probably familiar with. So I would then just login using my Google identity.
Cool, so it is doing the OS handshake and here we are at the homepage and as you can see, this is the list of workspaces I have under this account. So here I’m going to jump to the usage view. So this is what I’d like to focus on. So as you can see here on the top, we have a nice usage graph that shows the usage data across a period of time. And then at the bottom, we have a table showing kind of more fine-grained breakdowns for different workspaces here. So in the graph, actually, you can also slice and dice to see a couple of different things. The first is, you can change the kind of the wide access for your graph either to show the dollar amount or show the DBU amount. And then you can also check the usage by workspace. Then we’ll just show you the top workspaces with the highest usage, and you can also see the details if you hover your mouse on the graph.
And another common ask people have, is, “I would like to know how much DBU I spent on the interactive clusters versus the automated job clusters.” Then we also can kind of, show the usage by SKU, and then, of course, the total usage. And then the other thing you can do is check the usage for a certain period of time, and you can select a fixed time period, or you can also just select whatever time range you want to see your usage.
And another thing I want to call out here is, you can also download your usage data. It has all the detailed information, all the fine-grained usage data. You can either select current month, last three months, last six months, whatever range you want. And also, you can select to choose the username. So this is some personal information that might be also useful for chargeback purposes. And just one click from here, you got the data in a CSV file. So yeah, that’s also a common ask from our customers.
And in terms of the detailed usage table down below, you can scroll down to see all the workspaces and you can also search which workspace you want to see the detailed data, and similarly, you can also select either we show the dollar amount or show the DBU amount. And similarly, you also have a time range selector from here. So yeah, that is the usage rule I will like to show the account console, and yeah, that’s what the demo is about.

Abhinav Garg: Really good demo, Tianyi. So as you see the six different themes that we talked about, that are a must and should-haves for platform, governance, and security with the Databricks Lakehouse Platform, how they manifest into those different features and capabilities that admins can employ and security folks can employ. And this is a multicloud product, so all those capabilities are available across the three major public clouds that the platform is on.
So we really appreciate you taking this time to attend this particular session. Please leave feedback for how we can improve the content here, how we can improve any particular delivery demo, et cetera. And please feel free to leave the overall feedback for how the summit has been going for you. I also super encourage you to check out our lineup for the other sessions and attend those. If you are interested in security, if you’re interested in governance side, especially if you’re coming to check out the managed catalog-related sessions and there are many other related sessions by our partners and by our customers as well. So with that, I wrap up. And guys, and I thank you very much.

Tianyi Huang: Yeah, thank you everyone, for attending. And we would like to have your feedback.

Abhinav Garg

Abhinav is a product Leader specializing in Cloud, Data & AI, DevOps & Overall Enterprise Architecture. For the past three years, Abhniav has been helping Databricks customer meet their security and g...
Read more

Tianyi Huang

Tianyi is an Engineering Manager on the Databricks Enterprise Platform team. He is leading teams tha
Read more