A Deep Dive into Spark SQL’s Catalyst Optimizer

Download Slides

Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees. In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.#SFdev0

Learn more:

  • Deep Dive into Spark SQL’s Catalyst Optimizer
  • Cost Based Optimizer in Apache Spark 2.2
  • Catalyst: A Query Optimization Framework for Spark and Shark


    « back
  • About Yin Huai

    Yin is a Staff Software Engineer at Databricks. His work focuses on designing and building Databricks Runtime container environment, and its associated testing and release infrastructures. Before joining Databricks, he was a PhD student at The Ohio State University and was advised by Xiaodong Zhang. Yin is also an Apache Spark PMC member.