Databricks, the Unified Analytics Platform, has always been a cloud-first platform. We believe in the scalability and elasticity of the cloud so that customers can easily run their large production workloads and pay for exactly what they use. Hence, we have been charging our customers at per-second level granularity.
Until last month, billing on AWS has been based in hourly increments. Recently, they moved to per-second level billing. This move from AWS coupled with Databricks’ per-second level billing enables a huge shift in the architecture for big data processing. It eliminates the need for the unnecessary complexity brought by resource schedulers like YARN in the cloud and provides a much simpler and more powerful way to run production big data workloads.
Because of the hourly increments in billing, users spend a lot of time playing a giant game of Tetris with their big data workloads — figuring out how to pack jobs to use every minute of the compute hour. Examples:
The above problem was compounded if there were many such jobs to be run. To handle this challenge, many organizations turned to a resource scheduler like YARN. Organizations were following the traditional on-premises model of setting up one or more big multi-tenant cluster on the cloud and running YARN to bin-pack the different jobs.
With second-level billing, using resource schedulers like YARN on the cloud is an added layer on top of cloud compute services like EC2. In fact, we believe that it is an anti-pattern in the cloud and elasticity in the cloud removes the need for this unnecessary complexity.
Hence, at Databricks, we recommend taking full advantage of the inherent elasticity of the cloud: Let each job spin up its own cluster, run the job on this new cluster, and terminate the cluster automatically once the job finishes. In other words, you can use the resource profile that matches the needs of the job. We also take care of provisioning the resources when the job requires them with the given profile and we automatically de-provision them once the job is complete.
This simple approach to running production jobs in the cloud has tremendous benefits:
There is one common misconception that users initially have with this approach: isn’t having every job run on its own cluster very inefficient as they no longer share resources? The answer is a simple no. The above approach is very similar to how an application is run on YARN. We have just removed all the other unnecessary scheduling complexities. So this approach is as cost efficient as running it on YARN.
This simple approach requires some key functionalities to bridge some gaps when not using the traditional multi-tenant clusters:
With the recent announcement of AWS second-level billing, organizations can finally break the complex on-premises model of running multiple jobs on a traditional multi-tenant cluster and move towards a much simpler and more powerful model of running each production job in its own ephemeral cluster in the cloud. This model will allow data engineers and data scientists to be very productive and focus more time on working with their data instead of configuring their infrastructure for optimizing costs.
If you are interested in trying out this new model in action, you can sign up for a free trial of Databricks.
If you have any questions, you can contact us for more details.