Ron Hu is a Principal Big Data Architect at Huawei Technologies where he works on building a big data analytics platform based on Apache Spark. Prior to joining Huawei, he used to work at Teradata, Sybase, and MarkLogic with a focus on parallel database systems and search engine. Ron holds a PhD in Computer Science from University of California, Los Angeles.
Apache Spark 2.2 shipped with a state-of-art cost-based optimization framework that collects and leverages a variety of per-column data statistics (e.g., cardinality, number of distinct values, NULL values, max/min, avg/max length, etc.) to improve the quality of query execution plans. Skewed data distributions are often inherent in many real world applications. In order to deal with skewed distributions effectively, we added equal-height histograms to Apache Spark 2.3. Leveraging reliable statistics and histogram helps Spark make better decisions in picking the most optimal query plan for real world scenarios. In this talk, we'll take a deep dive into how Spark's Cost-Based Optimizer estimates the cardinality and size of each database operator. Specifically, for skewed distribution workload such as TPC-DS, we will show histogram's impact on query plan change, hence leading to performance gain.