Dynamic Partition Pruning in Apache Spark - Databricks

Dynamic Partition Pruning in Apache Spark

In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.



« back
About Bogdan Ghit

Bogdan Ghit is a computer scientist and software engineer at Databricks, where he works on optimizing the SQL performance of Apache Spark. Prior to joining Databricks, Bogdan pursued his PhD at Delft University of Technology where he worked broadly on datacenter scheduling with a focus on data analytics frameworks such as Hadoop and Spark. His thesis has led to a large number of publications in top conferences such as ACM Sigmetrics and ACM HPDC.

About Juliusz Sompolski

Juliusz Sompolski joined Databricks in January 2017, as one of the founding software engineers of Databricks Amsterdam European Development Centre. He is working on optimizing the SQL performance of Databricks Runtime. Most recently, he concentrates on performance of Business Intelligence workloads.