Fang Cao is a Senior Staff Software Engineer at Huawei Technologies where he works on building a big data analytics platform based on Apache Spark. Prior to joining Huawei, he was a Senior Software Engineer at Cisco and Lucent. Fang holds a Master’s degree in Computer Science from the University of Wisconsin at Madison.
In Spark SQL's Catalyst optimizer, many rule based optimization techniques have been implemented, but the optimizer itself can still be improved. For example, without detailed column statistics information on data distribution, it is difficult to accurately estimate the filter factor and cardinality, and thus output size of a database operator. With the inaccurate and/or misleading statistics, it often leads the optimizer to choose suboptimal query execution plans. Furthermore, it may cause Out-Of-Memory exception during execution because memory requirements have been erroneously estimated. In our project, we use Hive's Analyze Table statement to collect the detailed column statistics and save them into meta-store. For the relevant columns, we collect number of distinct values, number of NULL values, maximum/minimum value, average/maximal column length, etc. With the number of distinct values and number of records of a table, we can determine how unique a column is although Spark SQL does not support primary key. This helps determine, for example, the output size of a multi-column group-by operation. To deal with data skew effectively, we have also built one-dimensional and two-dimensional height-balanced histograms for column data distribution. With Catalyst's extensible design, we implemented a new mechanism to better estimate the output size of each database operator. With reliable statistics, we were able to make good decisions in these areas: selecting the correct build side of a hash-join operation, choosing the right join type (broadcast hash-join versus shuffled hash-join), setting optimal number of partitions in shuffle operation, adjusting multi-way join order, etc. We have run TPC-H benchmark and some real-world test cases. Our results show good performance gain in 2X-5X speedup depending on data volume. In this talk, we will review these results, the techniques used to accomplish them and we will share our experience and lessons learned in enhancing Spark SQL optimizer.