Introducing NEPHOS: Implement Lakehouse Without the Infrastructure Management

May 27, 2021 04:25 PM (PT)

Enter the next phase of democratized analytics and AI to increase scale, agility and reduce time to innovation. With the introduction of NEPHOS, Databricks is improving how organizations engage with data and analytics platforms by placing greater control in the hands of data teams. By helping data teams get to work faster while providing infosec and platform admins confidence that the governance and security policies are enforced, organizations can securely accelerate innovation. NEPHOS unleashes data teams to get to work faster and smarter with instant compute, optimal price/performance ratio, and inbuilt security and compliance features. In this session, learn how the newly announced automation enables workspaces to be ready to run notebooks or execute SQL queries instantly without the labor-intensive and manual tasks of setting up infrastructure.

In this session watch:
Aaron Davidson, Software Engineer, Databricks
Vinay Wagh, Director of Product, Databricks

 

Transcript

Vinay Wagh: Hi. My name is Vinay Wagh. I’m a Director of Product here at Databricks. I’ve been with the company for almost three years now. My focus in the company is around all things platform, infrastructure and security. And today, I have with me Aaron Davidson, who is one of the Senior Staff Engineers at Databricks, leading what we have done as Project Nephos.
Project Nephos is a project within Databricks that’s building a new version of our computer platform, and today I’m really excited to talk to you about not only the genesis of what brought us here, but also what the immediate and the future looks like. So, let’s get started.
If the last decade has shown us anything, it’s that companies that are able to amass large amounts of data, be able to analyze it, are able to build new products, customer experiences and business efficiencies that have allowed them to dominate their market segment. We believe that the companies that are going to dominate the next 10 years in every market segment are going to be analytical data companies. These are going to be companies that are building data products.
So, what are data products? Well, a credit card is a data product because it relies on collecting and analyzing thousands and thousands of credit card transactions to possible even detect fraud. Companies that do this well are able to get better credit card rates and save a lot more money on fraud. Pharmaceutical drugs is a data product, because it relies on collecting a lot of patient health data as well as drug simulations in order to build new and release new drugs sooner to the market.
In fact, self-driving cars are a data product, because they rely on collecting large amounts of map data, along with a lot of sensor and image data to essentially steer you. The question is, if data is so important to your business, then are companies doing it? Well, we have some data about the data. A recent McKinsey analytics report says that they’ve analyzed a bunch of companies that are in different stages of building out the data architectures, from early stage, to have invested in modern data architectures, to full-fledged and trying to operate modern data architectures at scale. And what that study found was that high-performing companies, the ones who are performing at the top, are likely to have the more modern data architectures that support analytics at scale. That means the scale word actually here is pretty important because that’s when you end up arriving at the benefit of the investments in data.
These companies who are performing well have realized that innovation happens when you enable all of your users with all of the data they need in order to build products, efficiencies and new customer experiences. It’s not about a single team within your company. It is about having all your machine learning and engineers, scientists and analysts be able to query data that they need in order to do their work.
But bridging this gap between all of the users in your company with all of the data they need has been a challenge. And in fact, still is. A recent micro-strategy report shows that when employees of these companies were asked about, how do you go about making a data-driven decision today? 79% of these employees said that in order to make a data-driven decision, they actually have to reach out to IT, maybe even a business analyst within their company, for help. Only 7% of the companies that actually … 7% of the employees, sorry, would have access to a service too that they could help themselves. And the rest were actually winging it.
And so what is the reason for that, and how do you go from being that company to being one of the high-performing ones? And we believe the lakehouse architecture is the answer. It allows you to bring together data lakes and data warehouses under one umbrella, allowing you to process and analyze large amounts of structured, semi-structured and unstructured data within the same interface, be able to do machine learning, analytics, and AI all under the same data umbrella.
And Databricks and our data analytics platform provides that. It bridges that gap. It sits between all your users and all the data. Now that being said, going from a small team that’s building a little project, let’s say a recommendation engine, to full-fledged and operationalizing everybody in your company to be able to use data, are two different challenges. And when you go from a small team to all your users, you’re going to see a lot of different patterns, like the amount of computer infrastructure you use, the number of environments you have to operate. All of those things increase the complexity. And that’s just the challenge that comes with scale.
The challenges, if you think about it from the perspective of what the users need in order to get this and actually run their query, is the challenge come in three areas when we look at a platform. And the first one is elastic instant compute, which is most of these users need access to be able to run these queries. They don’t want to wait. Giving everybody … Let’s say you have thousands of engineers that you want to enable. Giving them access to instant compute is a channeling problem, because everybody needs to be able to get compute when they want it.
The other aspect of it, this is security and compliances. Well, we keep talking about all users, all data. Well, that can sound scary to people within the infosec and governance team because you want to protect that data as also really valuable, and probably has a lot of confidential stuff in it. So, you want to make sure that you account for a security and compliance built into the platform before you enable all these users.
And the third is performance and cost. You could go ahead and let every one of your employees do whatever they wanted with any tool, but then it’s going to come at a cost. And also going to be inefficient. And so when you look at this challenge realistically, you look at it as enabling all these users to be able to do what they want, you have to tackle these three problems. And you have to figure out how you’re going to operate in that environment.
Let’s take each of these one by one. Let’s talk about elastic instant compute. We’ve been talking to customers for a while now, and when you take into account that launching instance in the cloud providers … And this can be conservative or optimistic depending on which cloud provider you’re talking about. It takes minutes to launch an instance, let’s just say that. Maybe it’s three, five, in some cases can take up to 10 depending on what type of instance you’re trying to launch and how many of them. And so it takes that long to launch it. And if you look at engineers who want to use that data, they don’t want to wait for these clusters to boot up because it’s how they work. I mean, imagine if somebody told you … You wanted to use Google Docs and I said, “Hey, it will start in … Start editing this document, you’ve got to wait five minutes for the compute to boot up.” It sounds ridiculous, but that’s what these developers are doing. They sit at their desk and start typing code. So, they don’t want to wait.
The other thing, the other data point here that’s worth considering, is that most of the jobs, if you look at 80% of the jobs that run, they run for about five minutes or less. And if you contrast that with the amount of time it takes to start an instance, you’re going to wait five minutes to run something for five minutes. It just seems like it’s a long time, basically.
And the third is that in a lot of cases, in a lot of our conversations, data teams have been tasked with providing an SLA to their end users. Let’s say, for example, you create your goal tables and they’re meant for creating business data by analysts to produce reports or whatnot. In a lot of cases, data teams have to guarantee that, hey, your query is going to be able to run in under one minute. And that is challenging in this environment, because if you decide that you want to launch a cluster when somebody wants to run a query, of course that’s going to take a long time. You’ve already violated your SLA.
But even if you decide that you wanted to order scale when somebody I command and you said, “You know what? I don’t want people cluster running order scale, and so I don’t have to launch all instances when queries run,” even that doesn’t work because order scaling also is launching instances. And let’s say it has to wait five minutes or three minutes to launch another instance, it is going to slow down the query and it is not going to properly meet the SLA.
And so if you look at all these problems combined, getting instant compute is something that requires investment and is also challenging. It is challenging if you don’t have to run compute all the time. Because some of the customers that we talked to who encountered these problems were able to get around these problems by just not shutting off their clusters. They said, “We’ll just keep these things running, and that’s it.” So, you don’t have to worry about it. But then you’re provisioning for essentially worst case all the time, at least for working hours, maybe. So, that becomes inefficient.
Now, let’s look at the second problem, which is compliance, security and compliance. And in this scenario, I think it’s important to realize that the way teams are structured today in these large companies, each team is given their own cloud environment. So, let’s say hypothetically if we talk about AWS, I get my own VPC AWS account. My building happens based on what I use in that account. And so each team has to launch infrastructure, then, in order to process. For data analytics, if you’re an AI and other sort of applications, if you’re launching these VMs that are going to process the data, they’re running in your environment, in your account.
Now, what are the challenges of that? Well, if you look at it, once you have an environment you have to figure out … Not figure out, but rather you have to configure the firewall rules for making sure that the Internet isn’t open to it. You have to prevent data exfiltration, so making sure that you’re only allowing access to the things in the new environment that these clusters are allowed to access. And so that requires a bit of a sort of configuration and operationalization.
The other thing is compliance and monitoring. There are VMs running there, there’s an environment in which you’re running compute that possible processes sensitive data. At some point, you’re going to have to … Many of these companies we talked to are also regulated. So at some point, they have to provide audits, what’s going on, detect for any kind of malicious activity in this environment. That requires investments, because now you’re investing in tools, you’re investing in agents, you’re investing in products that are going to help you, going to analyze this. And agree that they’re probably centralized in a company, but every environment that you spin up adds to it. It adds to the cost and it adds to the operations.
The other part of it is just the infrastructure operations associated with it. Well then, having a [V-Net] or a VPC or project or whatever is fine, but then being able to configure it correctly, making sure that you’re managing drift, making sure that you’ve automated all the infrastructure deployments either through CloudFormation, Terraform, whatever you might be using. All of that is something you have to do in order to start being able to launch clusters and use your data and stuff like that.
Those are the challenges. Now, you take that sort of approach and then decide that you want to scale it across the company, across all your teams. And in many cases, they are probably hundreds if not … In some cases, we’re seeing the high orders of hundreds, different AWS accounts and teams that have to get access to data. This means that you have to now start managing all of that infrastructure environment, monitoring [inaudible], all of these kind of things. And so it’s just something that you can master before you can scale it towards that goal of having all users get access to the data they need. And so, yeah. It requires scaling and operations in order to do that.
Now, let’s look at performance and cost management [inaudible]. If we take that environment where a single team is operating, let’s just zoom in on one thing. Let’s say you take a team that’s building a recommendation in some project that they’ve decided to do, they have a pattern of launching clusters. There’s a team that comes in every day at 9:00 AM, possibly leaves at 7:00 PM, whatever the timing is. They launch a few clusters, they’re using some number of machines. Their usage goes down towards the night. And then they come back next morning, or they have jobs running that run periodically, maybe daily, hourly, weekly, whatever it is. They have these clusters and infrastructure needs, and that’s their level of activity.
Now as somebody who is managing the platform or a lead on a data team has to basically retract usage in order to report, “Hey, you’re using so many VMs.” Or, “You’re spending this kind of money.” And that’s kind of reporting that has to be put in in order to monitor what’s going on. And then you have to monitor the utilization. What if people are running these machines? If I launch a 100-node cluster and I barely even use it, how will you even know about it? And so people or data teams, in order to be efficient, have to build in some kind of monitoring and utilization if they want to be able to save cost. We’ve spoken to a lot of customers where they’re like, “Well, data is a priority. We’re not looking at that right now. Our focus is on enabling people.” Well yes, fine. When you go to a certain scale, you might care about it. And so building in this utilization and monitoring is key to be able to optimize on cost.
And then, based on what you’re seeing, based on the data you have, you have to configure the right cluster policies and instances policies and things like that so that you can, over time, really know the cost. Because one thing that in this environment you have to understand is that when these teams build their products, they themselves do not know exactly what their compute needs are going to be. If you write some platform code and I asked you, “How many codes do you want?” In all likelihood, you’re not going to know what that is. Because you cannot map code to number of codes. And so you start with not knowing how much compute you need, and you converge on the right amount. And that may require tooling, and it requires you to get enough that you build all this tooling in to do it.
And so it’s something that you have to do. And then when you decide to scale it across all your teams, I imagine all these AWS accounts or all these Azure accounts or whatever, all of them are running these VMs. You have to build in this tooling and then also be able to optimize, right? Not just get visibility, but tell these teams, “Hey, maybe you should use a different instance type?” Or, “Maybe consolidate your clusters. You’re spending too much money.” Those kind of things, all of this is tooling that has to be prepared.
If you look at it from this perspective, these challenges is what brought us to Project Nephos. And Project Nephos is a new version of that platform that we’ve started work on that we will have an early version of this platform this year. It’s aiming to address these issues. It’s basically building a compute platform that is going to be built in with certain flexibilities and built in with certain optimizations so that you don’t have to worry about it.
Let’s talk about it at a really high level first before we dig in. The architecture for Project Nephos is … Essentially what we’re going to do, the change from what we’ve been doing is that we’re going to now run the cluster compute in our environment, in our account, and manage it on behalf of our customers. Aaron is going to go into the architecture of how we isolate these things and go deeper into how we run these things, but at a high level, we’re going to have workspaces within data rigs. No matter how many you spin up, they’re going to have clusters that we are going to take care of running and isolating for you. And what we would do is make sure that we are responsible for the compliance and security and audit of this environment, making sure that you don’t have to worry about the tools and monitoring and auditing and all this stuff. That would be built in.
We’re going to also take care of being able to, in a way, help you optimize costs for these VMs running, and also the performance cost optimization trade-offs, as well as be able to give you instant compute. And let me go a little bit deeper into that. But the idea is in order for us to operate, allow you to scale, we need to be able to control this and almost manage it programatically as opposed to manually. And that’s what it takes to do it, because it’s not something you can do manually.
All right. Let’s take a look at how we achieve instant compute in Nephos. All right, let’s take for instance we have a whole bunch of users that have created workspaces and they’re working in their own environment. And they’re running queries and jobs, and that requires the launching of clusters, and those clusters are getting launched now within our environment, or a VPC that’s managed by database in the Databricks cloud account. And as these VMs are launched, they get a cross-account identity that allows them to access the data in your environment in order to process and analyze the data. This is something you would have provided when you created the workspace, which is something you do today. You would have to add the ability for us to assume it in order to access that data.
But once that happens and these clusters are running their processing queries, we learn. We start learning about how many VMs are getting used across different workspaces, or even across an entire region. And then we use what we’re learning to predict at any given point in time what the need will be at any given part in the day, at any given 10-minute window, how many VMs are we going to need?
And then we use that to pre-launch VMs and keep them ready in a pool. And those VMs are just sitting there, waiting to be assigned to a cluster. And as these users launch more clusters, or terminate them, we basically allocated them free instances from this pool. Now, we don’t free instances back to the pool. We always give instances back to the service provider, cloud provider, whenever we’re done with them. But that is a set of pre-launched VMs just sitting there so that when you do new cluster launches, you get it right away. So, this is what allows us to give clusters in seconds as opposed to minutes, which is a game changer. And I’ll explain why, and the implications of this that may not be obvious.
Now, once we have a system where we are in control of launching, isolating these things and monitoring them, we have a set of things that we can do here. In the near term, we can scale the clusters based on usage. So, we can figure out okay, you know what? This cluster has more users and more queries being run, ought to scale it more aggressively or less aggressively. Basically, we can be more flexible about that than we are today. And then we can manage the spot fleets and policies on behalf of the customers.
One thing we can do is make sure that if reducing cost is something that’s a priority, or managing costs, then we can make sure that the worker nodes for these VMs for these clusters are purely based on spot. What should be the person-data SPLODGE in order to make sure that you never are slowing down the workload substantially, but at the same time you’re saving money by making sure there’s the right person-date SPLODGE to make it cost effective? Those kind of things, as well as policies that govern, like our existing cluster policies, those kind of things, we can configure on your behalf.
Then, we can also autoterminate frequently to save cost. This would not be possible if we weren’t even able to build a system like this, because today our customers do not autoterminate frequently enough. And the reason they don’t autoterminate is because of the first reason, users do not like to wait for machines. And so imagine if I autoterminated every 30 minutes because of inactivity. If any of these engineers or scientists went into a meeting and came back, their clusters wouldn’t be up. They’d have to relaunch them. So, what we figured out from our conversations is that 90 minutes seems to be an acceptable one for non-critical clusters so that it accounts for somebody going to a meeting and taking a short break and coming back to their desk. Allows them to continue running without having to relaunch clusters.
But with this architecture, you can just terminate them whenever you want. You can say, “I want a 10-minute autoterminate,” because when it launches up, you’re getting it in seconds. It’s not like you have to wait a long time. This changes it because now if you autoterminate frequently, you’re saving money.
And in the future, we’re going to focus on automatically adjusting the instance and cluster sizes. It’s something that’s aspirational where we can … If you’re using a small cluster, we can say, “You know what?” Or if you’re using, sorry, a large cluster, we can recommend and say, “You know what? You don’t really need a large for this because it’s not really being fully used. Maybe you should use something smaller.” Those kind of things are possible, as well as we can look at a future where we could schedule jobs in order to reduce cost. And then you can optimally schedule them.
And a simple example of this would be if you took a 10-node cluster and decided that we are going to run these five jobs on it, and that will increase realization of the same but not compromise on the security aspect since they belong to the same customer, same team or whatever, we can actually reduce cost and increase the efficiency by doing that. All it takes is a little flexibility on when that job completes. And so as long as that exists, we can make that decision and say, “You know what? Bash these together. Run them at the same time. We’ve now saved time and money.” And so those are the kind of things we can do with this platform.
In a sense, it’s not just about security, which we spoke about, instance compute, as well as performance and cost savings. Nephos is about simplifying. And to talk more about how we simplify our product and go deeper into what our network and isolation and other things look like, I’m going to hand it off to Aaron. Aaron, you want to take it?

Aaron Davidson: Thanks, Vinay. As Vinay said, Nephos is about simplifying the Databricks product and compute management. Let me talk about how. First, I want to talk about simplifying the notebook experience. If you use Databricks, you’ll know that before you can get started with a notebook, you need to provision a cluster. And that can actually be somewhat of an ordeal by itself because you need to decide a whole host of configuration options. For example, instance type, the number of nodes, the Spark version, whether you want spot or on-demand. And oftentimes, a different team will actually own the cluster configuration and the management of clusters. And so you might have to talk to that team just to get started.
One way that Nephos can help simply this, as we move the compute into the Databricks account, into our management, we can actually make it super easy for users to get started and just run a command on a notebook. And by default, launch a small cluster for them for that individual user, for that individual notebook session. And if you want to then later attach to more data, or if you want to have a finely-tuned cluster, you can still attach to that existing cluster. But in general, the goal was to make the default experience really simple, really streamlined. And we can do that using Nephos.
Another example that Vinay also mentioned is around simplifying our SQL analytics product. Databricks SQL is a first-class product in Databricks where people go to create dashboards, or ask questions, ask queries and generate reports, that kind of stuff. But today in SQL, although SQL is a very interactive thing where people will come, they’ll do a session and they’ll ask a few questions or generate their report, and then they’ll go away having done that job and maybe come back maybe several minutes later, maybe hours later, a whole day later …
Today, it’s very difficult to manage the lifespan of these clusters because if you want to save cost, you’ve got to turn off those clusters immediately. But then when you come back, you may have to wait several minutes for those clusters to be respawned because we actually give the VMs back to the cloud provider because it’s running in the customer account. In their account, when you want to turn over a cluster, you actually have to hand those back, or else you’re paying for them. And so then to recreate that cluster, you have to relaunch those VMs from the cloud provider and reprovision Spark and connect it, and that’s what takes time. That’s what takes a few minutes.
On the other hand, if you want to provide a good user experience then what you’re going to have to do is either keep a cluster running all the time or provision a wormhole that would let you share those VMs across whatever clusters needs them. But in either case, you’re actually paying for those VMs even when no one’s using them. And so you have to make this trade-off. Either you have the worst customer experience, or you pay more money.
And so as we move the compute into our account, we can actually take this trade-off away from the user by virtue of giving you both instant compute. So, once you’re going from no compute to running, we can do that within seconds. But also no cost overhead without having to pay for them while you’re not using them. And the way we can do that is by having a pool of shared VMs, of warm VMs in our account, and advertising that across all customers. And so it really transforms the way users use and manager the SQL experience, that they can come and go as they please without paying for the VMs all the time. It really fundamentally shifts the paradigm, so that’s really cutting.
But as we move compute into our account, we can actually break out of the bubble of posting Spark as our main way to access large amounts of data and to build data products. Today, Databricks launches Spark clusters end users account. And Spark is great for batch processing, for stream processing, and for SQL, all sorts of things. But it’s not good for many data use cases, such as model serving. It’s very hard to build a model serving system on top of Spark. And so by moving compute into our account, we can abstract away the kinds of compute that underlie Databricks, and we can actually build new types of products.
One example here is model serving. For those not familiar with model serving, the idea is you train a model. Let’s say it’s a recommendation model, a Netflix recommendation model, that takes a set of features about a particular user and produces a set of recommended shows, in this case. Many times, you’ll apply this model in a batch fashion, in an offline fashion. Let’s say, take all of yesterday’s users, predict the shows they want to watch, and then use that for tomorrow. Or you could do streaming, maybe every hour you incrementally update it. But many use cases are subject to one thing, real-time inference, meaning that as users actually make queries on the website, they need to hit the model and get the recommendations out. And so this is where at Databricks and in general, there’s a gap in actually being able to operationalize and productionize machine-only models.
In particular, it often requires an extra team in the middle. So, a data scientist will produce the model. It will be great, and then they have to hand it off to another team to rebuild that model and operationalize it. And at Databricks, we’d like to remove that step, to allow the data scientists to directly productionize it, directly do the AB testing, without the need for that intermediate team.
And so we can do this using Nephos because we can abstract the way they compute. We can start running model serving as well as clusters in our account. And we can manage the availability and scalability of that offer.
Overall, Nephos is about simplifying data science and machine learning. The first hard part of any data science and machine learning platform is actually scalability, availability and security. And these are very hard things to build. But at Databricks, this has become core to our product. We’ve gotten really good, and we’re always improving, at scalability, availability and security. And so one awesome thing about Nephos is it gives us the ability to apply this expertise, to do new demands such as model serving, but there’s many other data products out there that Databricks can help build and will build on top of the Nephos architecture. The model serving is just the first that will go to market. And it also provides fundamentally faster compute and spin up, so we can keep these warm pools [inaudible] across multiple customers, and that really helps both the notebook and the SQL products.
The first use case that you’ll see production ready, ready to use, is that SQL product. Later this year, you’ll be able to get instant compute, so cluster target time less than 15 seconds is our first goal. We can always reduce that over time. But that’s close enough to interactive. We want to be able to automatically tune the client’s performance by shutting off that cluster after 10 minutes after the last query finished.
And we want to keep it simple and secure without any operational overhead. We’ll manage the compliance. We’ll keep the simplified access controls, so we’ll propagate user identity through our clusters back to the cloud storage, back to the catalog layers. That’s very important. And we’ll have our standard features of SSD encryption and DBFS instance profiles, and of course strong isolation between customers. This is the first one [inaudible].
Now, we’ve talked about some of the use cases of Nephos and what’s coming soon. I want to actually spend some time digging into the architecture as well, so that you can understand how we build this product, and also how it provides that security. That’s very critical for every use case that runs on top of. And so let me break down this diagram for you which explains the architecture. On the left hand side, you see the control plane. This is the Databricks control plane services, such as our web applications, such as our jobs engines, such as our cluster manager. This is the same among every architecture, so whether the data plane is running in the customer account or whether it’s running in the Databricks account, that control plane looks the same. Instead, we focus on the data plane or the compute layer. This is where the clusters run. This is where the compute runs.
First, we have the systems services in the top left. Those are just isolated pods that manage the rest of the data plane. Below that are the unallocated, unassigned clusters. And this is very interesting. What this means is an entire Spark cluster, ready to go, that’s already provisioned, and it’s just not assigned to a customer. We mentioned wanting to get a startup time below 15 seconds. Well as you may know, Spark is actually a pretty heavyweight process. It’s JBM based, you have that cold start compilation times. It has a Hive Metastore to connect to, has to connect to the executors and establish all those connections. Even starting Spark in 15 seconds is hard, let alone any other rejigging they have to do in the system.
So, what we decided to do was actually pre-provision Spark clusters, have them ready to go and allocated, and only late buying them assign them to customers as needed. And so once a customer does need that Spark cluster, we just pop it over and put it into that customer’s domain. And from here, we have two layers of security isolation. The first is the container bunker, so all compute is containerized using unprivileged containers, using AppArmor, [SetComp], and every mechanism of detection that we have available to us. So, it’s a relatively secure boundary but we don’t want to only rely on the container boundary for security because this is running customer untrusted code.
And so the second boundary is the VM boundary. And one key point here is that VMs are not reused between customers. Once a VM is assigned to a customer, either a driver or worker VM, once that customer no longer is using it we actually give it back to the cloud provider. We terminate the VM and recycle it through the cloud provider. And we’ll allocate a new VM from the cloud provider, but essentially it’s all going through that hypervisor layer. So, you never have one VM that was used by one customer and that same VM is used by another customer. So, that’s VM isolation.
In addition, you can see these red boxes are the network isolation. And a VM running for customer A can only communicate with other VMs also running for customer A, and cannot establish connections to a VM in the unallocated pool, or a VM in the cluster B pool, customer B pool. Okay. As a result, even if I could break out a container, VMs are completely isolated from one another, providing strong isolation between our customers.
Now you might understand how we do our isolation story, how we do our compute management. But actually, equally important is how we do data access. Databricks is often used to access data where it’s at. This makes it distinct from some other products, such as data warehousing solutions where you ingest relatively clean data into the data warehouse and it kind of manages it from there. But Databricks is actually often used to do that initial ingestion, and so wherever your data is at, whether it’s in blob storage, a database within a VPC, an on-prem HDFS cluster, a red shift cluster, whatever, Databricks and Spark are excellent at scooping up that data and joining it together and producing those intermediate tables to be ingested into their data warehouse. And so this makes it a harder problem to access it. It often requires IM controls or network connectivity to those resources. And when Databricks is running in the customer’s account, when the data plan is running in the customer account, they can peer it together. They can establish a peering connection. They can give it an IM role. And they can make the connection happen to whatever private resource you need.
But as we move that to our account, these problems become harder. IM is actually similar, because the data plane will be able to assume customer-provided roles. Now, we have to be really careful that on a particular VM, a VM for customer X, only has credentials associated with the role of customer X. But we still have to maintain that VM isolation, but otherwise we can do the same kind of role-based access.
Networking controls, though, are where it gets a little harder, but it’s still possible. So, let me break this down. In order to do private routing … Most routing will go to S3, to blob storage, either blob storage or Google Blob Storage, and doesn’t require this kind of control. But when you need to access important, private data, when you need to do it to ETL it in, it will still be possible in a network architecture. That’s what I’m showing here. And it does depend on the cloud providers and the mechanisms. In this case, I’m giving an example of [inaudible] private routing mechanism.
So, customer X has a database, an RDS running in their private VPC. Today, as I said, if you’re running a data plane in your account you could just peer that with the Databricks VPC and then establish a direct connection. But it doesn’t work when the data plane in our account is multi-tenant, because then you couldn’t have two different customers with overlapping CIDR range, and we don’t control the customers’ CIDR ranges. Today, customers can manager this … It’s actually pretty challenging when you’re running a large enterprise to make all the CIDRs work out so that you can peer everything together that needs to be peered. But customers can and do do this today.
To give a picture of how this works, we have a customer VPC and then in the bottom we have the Databricks data plane VPC. And it’s kind of small. You can see that there’s customer X VMs and customer Y VMs running in the same VPC, and that’s kind of what makes this challenging. Now the first step is to create what’s called a VPC endpoint service in the customer X VPC. And a VPC endpoint service will allow us to expose this RDS into a Databricks VPC. To back a VPC endpoint, for those not familiar, you have to great a network load balance, so a layer for ELP. And to back that, you have to have a set of EC2 instances.
Ultimately, what we have to do is we have to create some EC2 instances, which are proxying over to the RDS, and it’ll be proxied over to the EC2 instances, and a VPC endpoint service will be associated with the NLB. That’s relatively complicated. Hopefully, Amazon will make it easier in the future to just say, “I went to expose this RDS directly as a VPC endpoint service.” But today, we need this extra layer of interaction.
As I said, this is only for very important data that you need to access from the data plane, and may require an AP team today. But hopefully, this will become easier in the future as Amazon expands their capabilities of their VPC endpoint services.
But today, that’s what the customer does. They set up the VPC endpoint service and we take it from there. And so what we’ll do is we introduce a new VPC, and you’ll see why. We have this idea of a router VPC. We bind that VPC endpoint service into the router VPC. And what that means is in the Databricks router VPC, there is now a set of private IPs, private interfaces, that if you access those private IPs it will actually route to the NLB. It will actually route to the [HA proxy]. But ultimately about the RDS itself. And those private IPs run the router VPC CIDR range, meaning that it doesn’t matter what CIDR range the customer has chosen. Now that we have the VPC endpoint in this router VPC, we can peer the router to the data plane VPC, and now the data plane running in our account on customer X VMs can access the VPC endpoint through the peering. The VPC endpoint will be able to access the RDS through the endpoint service since we have the end-to-end connectivity.
Now, equally important, just to point this out, from connectivity is not allowing connectivity from the customer Y VMs. And so the way we do that, you can see the other boxes, the security groups. The customer X VPC endpoint has a security group associated with it. All customer X VMs have a security group associated with it, and there’s a white list going so that only customer X VMs are allowed to establish connections to the VPC endpoint for customer X. And so customer Y VMs are unable to establish that connection, unable to use that endpoint. Therefore, we have secure routing into private networks, allowing you to ETL in that data as necessary.
And you can see here that this expands horizontally for multiple customers, and it also expands horizontally for multiple data planes, enabling us to move customers around as they grow across multiple VPCs. We’re not constrained to a single VPC. That’s why we have that router as a later direction across multiple data plane VPCs.
Okay. This showcases how Databricks, in the Nephos architecture, can both do IM controls but also network connectivity where it’s important. Of course, that setup on your side is not trivial today. But hopefully, again, that will be easier in the future.
To summarize with a roadmap, in 2021, expect to see the SQL analytics use case on Nephos with an Azure and AWS. Value is instant compute. In 2022, and maybe somewhat beyond, expect to see not only SQL analytics but also the workspace, so that’s the notebooks, jobs, arbitrary computation running in our account. As well as that model serving product I talked about, and other products down the line. But these are the three that are already underway. And if you’re interested in our private preview, please talk to your Databricks representative and we’ll talk about getting that started. Thank you.

Aaron D

Aaron Davidson

Aaron Davidson is an Apache Spark committer and software engineer at Databricks. His Spark contributions include standalone master fault tolerance, shuffle file consolidation, Netty-based block transf...
Read more