Min Qiu is currently a Software Engineer at a stealth mode startup. He was a Staff Software Engineer from Oct 2014 to May 2016 at Huawei Technologies where he worked on building a big data analytics platform based on Apache Spark. He has extensive hands-on experience on Spark SQL. Prior to joining Huawei, he was a Senior Member of the Technical Staff at Oracle, where he worked on the TimesTen In-Memory database team. Min holds a Master’s degree in Computer Science from the University of Wisconsin at Madison.
In Spark SQL Catalyst optimizer many rule based optimization techniques have been implemented, but the optimizer itself is still premature. During the process of building the big data analytics platform upon Spark SQL in Huawei, we discovered that the current optimizer has room for improvement. We observed unnecessary Cartesian Product operator as well as retrieval, materialization and exchange of irrelevant columns in several cases on THC-H benchmark. Due to these inefficiencies many important query plans behaved sub-optimally. For example, we often ran into OutOfMemeory exception during SQL execution due to the large amount of memory consumption, and the performance remained unexpectedly bad in the successful runs. In our practice, we implemented several new optimization rules and some key enhancements to existing rules. One set of rules is called Cartesian Product Elimination. The Predicate Rewrite rule is to refractor the filter expression so that the join condition hidden in disjunctive norm form is extracted and exposed. The Join Order Adjustment rule is to adjust the join order corresponding to the matched join conditions. We also enhanced the existing Column Pruning rule to cover more cases. These optimization improvements help reduce unnecessary computations and data exchanges in some specific queries. The performance results of the relevant queries from TPC-H benchmark show significant improvement (4-6X speedup depending on the data volume). Our implementation easily plugs into Spark SQL Catalyst optimizer without breaking the existing rules. We expect to have contributed our work to the community SPARK prior to the Summit. In our presentation, after a quick review of the Catalyst optimizer architecture, we will discuss our rule-based optimization enhancements in detail and show the resulting performance improvements. In the future we plan to open source our cost based optimization (CBO) that improves performance in all cases on TPC-H benchmark.