Apache Spark 2.2 shipped with a state-of-art cost-based optimization framework that collects and leverages a variety of per-column data statistics (e.g., cardinality, number of distinct values, NULL values, max/min, avg/max length, etc.) to improve the quality of query execution plans. Skewed data distributions are often inherent in many real world applications. In order to deal with skewed distributions effectively, we added equal-height histograms to Apache Spark 2.3. Leveraging reliable statistics and histogram helps Spark make better decisions in picking the most optimal query plan for real world scenarios.
In this talk, we’ll take a deep dive into how Spark’s Cost-Based Optimizer estimates the cardinality and size of each database operator. Specifically, for skewed distribution workload such as TPC-DS, we will show histogram’s impact on query plan change, hence leading to performance gain.
Session hashtag: #DevSAIS13
Ron Hu is a Senior Staff Software Engineer at LinkedIn where he works on building a big data analytics platform based on Apache Spark. Prior to joining LinkedIn, he used to work at Teradata, MarkLogic, and Huawei with a focus on parallel database systems and search engines. Ron holds a PhD in Computer Science from University of California, Los Angeles.
Zhenhua Wang is a Research Engineer at Huawei Technologies where he works on building a big data analytics platform based on Apache Spark. Prior to joining Huawei, he received a PhD degree in Computer Science from Zhejiang University. His research interests include information retrieval and web data mining.