Ke Jia is a software engineer at Intel, mainly focusing on big data area optimization. She is an active open source contributor to Apache Spark, Hive and Senty projects. Ke holds a master’s degree in computer science and software engineering from East China Normal University.
June 24, 2020 05:00 PM PT
Over the years, there has been extensive and continuous effort on improving Spark SQL's query optimizer and planner, in order to generate high quality query execution plans. One of the biggest improvements is the cost-based optimization framework that collects and leverages a variety of data statistics (e.g., row count, number of distinct values, NULL values, max/min values, etc.) to help Spark make better decisions in picking the most optimal query plan. Examples of these cost-based optimizations include choosing the right join type (broadcast-hash-join vs. sort-merge-join), selecting the correct build side in a hash-join, or adjusting the join order in a multi-way join. However, chances are data statistics can be out of date and cardinality estimates can be inaccurate, which may lead to a less optimal query plan. Adaptive Query Execution, new in Spark 3.0, now looks to tackle such issues by re-optimizing and adjusting query plans based on runtime statistics collected in the process of query execution. This talk is going to introduce the adaptive query execution framework along with a few optimizations it employs to address some major performance challenges the industry faces when using Spark SQL. We will illustrate how these statistics-guided optimizations work to accelerate execution through query examples. Finally, we will share the significant performance improvement we have seen on the TPC-DS benchmark with Adaptive Query Execution.