Yin Huai is a Software Engineer at Databricks and mainly works on Spark SQL. Before joining Databricks, he was a PhD student at The Ohio State University and was advised by Xiaodong Zhang. His interests include storage systems, database systems, and query optimization. He is also an Apache Hive committer.
In this talk, I will introduce the new JSON support in Spark. With the JSON support, users do not need to define a schema for a JSON dataset. Instead, Spark SQL automatically infers the schema based on data. Then, users can write SQL queries to process this JSON dataset like processing a regular table, or seamlessly convert a JSON dataset to other formats (e.g. Parquet file). I will also talk about our ongoing efforts on letting users easily work with data from different sources with different formats.
Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees. In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.#SFdev0 Additional Reading: