Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees. In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.#SFdev0
Yin Huai is a Software Engineer at Databricks and mainly works on Spark SQL. Before joining Databricks, he was a PhD student at The Ohio State University and was advised by Xiaodong Zhang. His interests include storage systems, database systems, and query optimization. He is also an Apache Hive committer.