Performing Advanced Analytics on Relational Data with Spark SQL - Databricks

Performing Advanced Analytics on Relational Data with Spark SQL

Download Slides

In this talk I’ll describe Spark SQL, a new Alpha component that is part of the Spark 1.0 release. Spark SQL lets developers natively query data stored in both existing RDDs and external sources such as Apache Hive. A key feature of Spark SQL is the ability to blur the lines between relational tables and RDDs, making it easy for developers to intermix SQL commands that query external data with complex analytics. In addition to Spark SQL, I’ll also talk about the Catalyst optimizer framework, which allows Spark SQL to automatically rewrite query plans to execute more efficiently.

About Michael Armbrust

Michael Armbrust is committer and PMC member of Apache Spark and the original creator of Spark SQL. He currently leads the team at Databricks that designed and built Structured Streaming and Databricks Delta. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.