In Spark SQL’s Catalyst optimizer, many rule based optimization techniques have been implemented, but the optimizer itself can still be improved. For example, without detailed column statistics information on data distribution, it is difficult to accurately estimate the filter factor, cardinality, and thus output size of a database operator. With the inaccurate and/or misleading statistics, it often leads the optimizer to choose suboptimal query execution plans. We added a Cost-Based Optimizer framework to Spark SQL engine. In our framework, we use Analyze Table SQL statement to collect the detailed column statistics and save them into Spark’s catalog. For the relevant columns, we collect number of distinct values, number of NULL values, maximum/minimum value, average/maximal column length, etc. Also, we save the data distribution of columns in either equal-width or equal-height histograms in order to deal with data skew effectively. Furthermore, with the number of distinct values and number of records of a table, we can determine how unique a column is although Spark SQL does not support primary key. This helps determine, for example, the output size of join operation and multi-column group-by operation.
In our framework, we compute the cardinality and output size of each database operator. With reliable statistics and derived cardinalities, we are able to make good decisions in these areas: selecting the correct build side of a hash-join operation, choosing the right join type (broadcast hash-join versus shuffled hash-join), adjusting multi-way join order, etc. In this talk, we will show Spark SQL’s new Cost-Based Optimizer framework and its performance impact on TPC-DS benchmark queries.