Extending Apache Spark SQL Data Source APIs with Join Push Down

Download Slides

When Spark applications operate on distributed data coming from disparate data sources, they often have to directly query data sources external to Spark such as backing relational databases, or data warehouses. For that, Spark provides Data Source APIs, which are a pluggable mechanism for accessing structured data through Spark SQL. Data Source APIs are tightly integrated with the Spark Optimizer. They provide optimizations such as filter push down to the external data source and column pruning. While these optimizations significantly speed up Spark query execution, depending on the data source, they only provide a subset of the functionality that can be pushed down and executed at the data source. As part of our ongoing project to provide a generic data source push down API, this presentation will show our work related to join push down. An example is star-schema join, which can be simply viewed as filters applied to the fact table. Today, Spark Optimizer recognizes star-schema joins based on heuristics and executes star-joins using efficient left-deep trees. An alternative execution proposed by this work is to push down the star-join to the external data source in order to take advantage of multi-column indexes defined on the fact tables, and other star-join optimization techniques implemented by the relational data source.
Session hashtag: #EUdev7

About Ioana Delaney

Ioana Delaney is a Senior Software Engineer in Silicon Valley Laboratory in San Jose, California. She was part of the DB2 for LUW development team until she recently joined Spark Technology Center at IBM. She worked in many areas of SQL and XML query compilation, including query semantics, query rewrite, query optimization, and federated/distributed compiler.

About Jia Li

Jia Li is an advisory engineer at IBM's Spark Technology Center (STC). Jia is part of STC's Spark SQL team. Prior to this role, Jia was a core developer for IBM's Optim Query Workload Tuner.