A Deep Dive into Spark SQL’s Catalyst Optimizer – Databricks

A Deep Dive into Spark SQL’s Catalyst Optimizer

Download Slides

Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees. In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.#SFdev0

Learn more:

  • Deep Dive into Spark SQL’s Catalyst Optimizer
  • Cost Based Optimizer in Apache Spark 2.2
  • Catalyst: A Query Optimization Framework for Spark and Shark

  • « back
    About Yin Huai

    Yin Huai is a Software Engineer at Databricks and mainly works on Spark SQL. Before joining Databricks, he was a PhD student at The Ohio State University and was advised by Xiaodong Zhang. His interests include storage systems, database systems, and query optimization. He is also an Apache Hive committer.