by Matei Zaharia, Reynold Xin, Xiao Li, Wenchen Fan and Yin Huai
We’re excited to announce that the Apache SparkTM 3.0.0 release is available on Databricks as part of our new Databricks Runtime 7.0. The 3.0.0 release includes over 3,400 patches and is the culmination of tremendous contributions from the open-source community, bringing major advances in Python and SQL capabilities and a focus on ease of use for both exploration and production. These initiatives reflect how the project has evolved to meet more use cases and broader audiences, with this year marking its 10-year anniversary as an open-source project.
Here are the biggest new features in Spark 3.0:
No major code changes are required to adopt this version of Apache Spark. For more information, please check the migration guide.
Celebrating 10 years of Spark development and evolution
Spark started out of UC Berkeley’s AMPlab, a research lab focused on data-intensive computing. AMPlab researchers were working with large internet-scale companies on their data and AI problems, but saw that these same problems would also be faced by all companies with large and growing volumes of data. The team developed a new engine to tackle these emerging workloads and simultaneously make the APIs for working with big data significantly more accessible to developers.
Community contributions quickly came in to expand Spark into different areas, with new capabilities around streaming, Python and SQL, and these patterns now make up some of the dominant use cases for Spark. That continued investment has brought Spark to where it is today, as the de facto engine for data processing, data science, machine learning and data analytics workloads. Apache Spark 3.0 continues this trend by significantly improving support for SQL and Python -- the two most widely used languages with Spark today -- as well as optimizations to performance and operability across the rest of Spark.
Improving the Spark SQL engine
Spark SQL is the engine that backs most Spark applications. For example, on Databricks, we found that over 90% of Spark API calls use DataFrame, Dataset and SQL APIs along with other libraries optimized by the SQL optimizer. This means that even Python and Scala developers pass much of their work through the Spark SQL engine. In the Spark 3.0 release, 46% of all the patches contributed were for SQL, improving both performance and ANSI compatibility. As illustrated below, Spark 3.0 performed roughly 2x better than Spark 2.4 in total runtime. Next, we explain four new features in the Spark SQL engine.
The new Adaptive Query Execution (AQE) framework improves performance and simplifies tuning by generating a better execution plan at runtime, even if the initial plan is suboptimal due to absent/inaccurate data statistics and misestimated costs. Because of the storage and compute separation in Spark, data arrival can be unpredictable. For all these reasons, runtime adaptivity becomes more critical for Spark than for traditional systems. This release introduces three major adaptive optimizations:
Based on a 3TB TPC-DS benchmark, compared without AQE, Spark with AQE can yield more than 1.5x performance speedups for two queries and more than 1.1x speedups for another 37 queries.
Dynamic Partition Pruning is applied when the optimizer is unable to identify at compile time the partitions it can skip. This is not uncommon in star schemas, which consist of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table, by identifying those partitions that result from filtering the dimension tables. In a TPC-DS benchmark, 60 out of 102 queries show a significant speedup between 2x and 18x.
ANSI SQL compliance is critical for workload migration from other SQL engines to Spark SQL. To improve compliance, this release switches to Proleptic Gregorian calendar and also enables users to forbid using the reserved keywords of ANSI SQL as identifiers. Additionally, we’ve introduced runtime overflow checking in numeric operations and compile-time type enforcement when inserting data into a table with a predefined schema. These new validations improve data quality.
Join hints: While we continue to improve the compiler, there’s no guarantee that the compiler can always make the optimal decision in every situation — join algorithm selection is based on statistics and heuristics. When the compiler is unable to make the best choice, users can use join hints to influence the optimizer to choose a better plan. This release extends the existing join hints by adding new hints: SHUFFLE_MERGE, SHUFFLE_HASH and SHUFFLE_REPLICATE_NL.
Many Python developers use the pandas API for data structures and data analysis, but it is limited to single-node processing. We have also continued to develop Koalas, an implementation of the pandas API on top of Apache Spark, to make data scientists more productive when working with big data in distributed environments. Koalas eliminates the need to build many functions (e.g., plotting support) in PySpark, to achieve efficient performance across a cluster.
After more than a year of development, the Koalas API coverage for pandas is close to 80%. The monthly PyPI downloads of Koalas has rapidly grown to 850,000, and Koalas is rapidly evolving with a biweekly release cadence. While Koalas may be the easiest way to migrate from your single-node pandas code, many still use the PySpark APIs, which are also growing in popularity.
Spark 3.0 brings several enhancements to the PySpark APIs:
Improving Python support and usability in Spark continues to be one of our highest priorities.
Hydrogen, streaming and extensibility
With Spark 3.0, we’ve finished key components for Project Hydrogen as well as introduced new capabilities to improve streaming and extensibility.
Other updates in Spark 3.0
Spark 3.0 is a major release for the community, with over 3,400 Jira tickets resolved. It’s the result of contributions from over 440 contributors, including individuals as well as companies like Databricks, Google, Microsoft, Intel, IBM, Alibaba, Facebook, Nvidia, Netflix, Adobe and many more. We’ve highlighted a number of the key SQL, Python and streaming advancements in Spark for this blog post, but there are many other capabilities in this 3.0 milestone not covered here. Learn more in the release notes and discover all the other improvements to Spark, including data sources, ecosystem, monitoring and more.
Get started with Spark 3.0 today
If you want to try out Apache Spark 3.0 in the Databricks Runtime 7.0, sign up for a free trial account and get started in minutes. Using Spark 3.0 is as simple as selecting version “7.0” when launching a cluster.
Learn more about feature and release details: