As the amount of data in an organization grows, more and more engineers, analysts and data scientists need to analyze this data using tools like Apache Spark. Today, IT teams constantly struggle to find a way to allocate big data infrastructure, budget among different users, and optimize performance. End-users like data scientists and analysts also spend enormous amounts of time tuning their big data infrastructure for optimum performance, which is neither their core expertise nor their primary goal of deriving insights from data.
To remove these operational complexities for users, the next generation of cloud computing is headed toward serverless computing. Products like BigQuery offer serverless interfaces that require zero infrastructure management for users. But all these existing products only address simple, stateless SQL use cases.
Today, we are excited to announce Databricks Serverless, a new initiative to offer serverless computing for complex data science and Apache Spark workloads. Databricks Serverless is the first product to offer a serverless API for Apache Spark, greatly simplifying and unifying data science and big data workloads for both end-users and DevOps.
Specifically, in Databricks Serverless, we set out to achieve the following goals:
At Data + AI Summit today, we have launched our first phase of Databricks Serverless, called Serverless Pools, which allow customers to run a pool for serverless workloads in their own AWS account. Hundreds of users can share a pool, while DevOps can control the resource cost of their whole workload in a single place. In future phases, we will also provide services to run serverless workloads outside the customer's AWS environment.
Databricks Serverless pools are automatically managed pools of cloud resources that are auto-configured and auto-scaled for interactive Spark workloads. Administrators only need to provide the minimum and maximum number of instances they want in their pool, for the purpose of budget control. End-users then program their workloads using Spark APIs in SQL or Python, and Databricks will automatically and efficiently run these workloads.
The three key benefits of serverless pools are:
There are multiple existing resource managers for Apache Spark, but none of them provides the high concurrency and automatic elasticity of serverless pools. Existing cluster managers, such as YARN, and cloud services, such as EMR, suffer from the following issues:
Databricks Serverless pools combine elasticity and fine-grained resource sharing to tremendously simplify infrastructure management for both admins and end-users:
Next, we look at the three key properties of serverless pools in detail.
Typically, configuring a Spark cluster involves the following stages:
Serverless pools drastically simplifies stage 1 and eliminates stage 2 and stage 3, by allowing admins to create a single pool with key AWS parameters such as spot bidding.
As mentioned earlier, predicting the correct amount of resources for a cluster is one of the hardest tasks for admins and users as they don’t know the usage requirements. This results in a lot of trial and error for users. With serverless pools, users can just specify the range of desired instances and the serverless pools elastically scales the compute and local storage based on individual Spark job’s resource requirements.
Autoscaling Compute: The compute resources in a serverless pool are autoscaled based on Spark tasks queued up in the cluster. This is different from the coarse-grained autoscaling found in traditional resource managers. The Spark-native approach to scaling helps in best resource utilization thereby bringing the infrastructure costs down significantly. Furthermore, serverless pools combine this autoscaling with a mix of on-demand and spot instances to further optimize costs. Read more in our autoscaling documentation.
Autoscaling Storage: Apart from compute and memory, Spark requires disk space for supporting data shuffles and spilling from memory. Having the right amount of disk space is critical to get Spark jobs working without any failures, and data engineers and scientists typically struggle to get this right. Serverless pools use logical volume management to address this issue. As the local storage of worker instances fills up, serverless pools automatically provision additional EBS volumes for the instances and the running Spark jobs seamlessly use the additional space. No more "out of disk space" failures ever!
Since serverless pools allow for fine-grained sharing of resources between multiple users, dynamic workload management and isolation are essential for predictable performance.
Preemption: When multiple users are sharing a cluster, it is very common to have a single job from a user monopolize all the cluster resources, thereby slowing all other jobs on the cluster. Spark’s fair scheduler pool can help address such issues for a small number of users with similar workloads. As the number of users on a cluster increase, however, it becomes more and more likely that a large Spark job will hog all the cluster resources. The problem can be more aggravated when multiple data personas are running different types of workloads on the same cluster. For example, a data engineer running a large ETL job will often prevent a data analyst from running short, interactive queries. To combat such problems, the serverless pool will proactively preempt Spark tasks from over-committed users to ensure all users get their fair share of cluster time. This gives each user a highly interactive experience while still minimizing overall resource costs.
Fault Isolation: Another common problem when multiple users share a cluster and do interactive analysis in notebooks is that one user's faulty code can crash the Spark driver, bringing down the cluster for all users. In such scenarios, the Databricks resource manager provides fault isolation by sandboxing the driver processes belonging to different notebooks from one another so that a user can safely run commands that might otherwise crash the driver without worrying about affecting the experience of other users.
We did some benchmarking to understand how the serverless pools fare when there is a concurrent and heterogeneous load. Here is the setup: many data scientists are running Spark queries on a cluster. These are short-running interactive jobs that last at most a few minutes. What happens when we introduce a large ETL workload to the same cluster?
20 Users on Standard Cluster
For standard Spark clusters, when ETL jobs are added, average response times increase from 5 minutes (red line) to 15 (orange line), and in the worst case more than 40 minutes.
20 Users on Serverless Pool
With a serverless pool, the interactive queries get a little slower when the ETL jobs start, but the Databricks scheduler is able to guarantee performance isolation and limit their impact. The ETL jobs runs in the background, efficiently utilizing idle resources. Users get excellent performance for both workloads without having to run a second cluster.
Comparison with Other Systems
We also tested the performance of larger, concurrent TPC-DS workloads on three environments: (1) Presto on EMR, (2) Apache Spark on EMR and (3) Databricks Serverless.
When there were 5 users each running a TPC-DS workload concurrently on the cluster, the average query latencies for Serverless pools were an order of magnitude lower than Presto.
With 20 users and a background ETL job on the cluster, the difference is even larger, to 12x faster than Presto and 7x faster than Spark on EMR.
Serverless pools are the first step in our mission to eliminate all operational complexities involved with big data. They take all of the guesswork out of cluster management -- just set the minimum and maximum size of a pool and it will automatically scale within those bounds to adapt to the load being placed on it. They also provide a zero-management experience for users -- just connect to a pool and start running code from notebooks or jobs. We are excited that Databricks Serverless is the first platform to offer all of these serverless computing features for the full power of Apache Spark.
You can try Databricks Serverless in beta form today by signing up for a free Databricks trial.