Michael Armbrust is committer and PMC member of Apache Spark and the original creator of Spark SQL. He currently leads the team at Databricks that designed and built Structured Streaming and Databricks Delta. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.
Security monitoring and threat response has diverse processing demands on large volumes of log and telemetry data. Processing requirements span from low-latency stream processing to interactive queries over months of data. To make things more challenging, we must keep the data accessible for a retention window measured in years. Having tackled this problem before in a massive-scale environment using Apache Spark, when it came time to do it again, there were a few things I knew worked and a few wrongs I wanted to right. We approached Databricks with a set of challenges to collaborate on: provide a stable and optimized platform for Unified Analytics that allows our team to focus on value delivery using streaming, SQL, graph, and ML; leverage decoupled storage and compute while delivering high performance over a broad set of workloads; use S3 notifications instead of list operations; remove Hive Metastore from the write path; and approach indexed response times for our more common search cases, without hard-to-scale index maintenance, over our entire retention window. This is about the fruit of that collaboration.
Query optimization can greatly improve both the productivity of developers and the performance of the queries that they write. A good query optimizer is capable of automatically rewriting relational queries to execute more efficiently, using techniques such as filtering data early, utilizing available indexes, and even ensuring different data sources are joined in the most efficient order. By performing these transformations, the optimizer not only improves the execution times of relational queries, but also frees the developer to focus on the semantics of their application instead of its performance. Unfortunately, building an optimizer is a incredibly complex engineering task and thus many open source systems perform only very simple optimizations. Past research has attempted to combat this complexity by providing frameworks that allow the creators of optimizers to write possible optimizations as a set of declarative rules. However, the use of such frameworks has required the creation and maintenance of special “optimizer compilers” and forced the burden of learning a complex domain specific language upon those wishing to add features to the optimizer. Instead, we propose Catalyst, a query optimization framework embedded in Scala. Catalyst takes advantage of Scala’s powerful language features such as pattern matching and runtime metaprogramming to allow developers to concisely specify complex relational optimizations. In this talk I will describe the framework and how it allows developers to express complex query transformations in very few lines of code. I will also describe our initial efforts at improving the execution time of Shark queries by greatly improving its query optimization capabilities. ________________  Graefe, G. The Cascades Framework for Query Optimization. In Data Engineering Bulletin. Sept. 1995.  Goetz Graefe , David J. DeWitt, The EXODUS optimizer generator, Proceedings of the 1987 ACM SIGMOD international conference on Management of data, p.160-172, May 27-29, 1987, San Francisco, California, United States Additional Reading:
In this talk I’ll describe Spark SQL, a new Alpha component that is part of the Spark 1.0 release. Spark SQL lets developers natively query data stored in both existing RDDs and external sources such as Apache Hive. A key feature of Spark SQL is the ability to blur the lines between relational tables and RDDs, making it easy for developers to intermix SQL commands that query external data with complex analytics. In addition to Spark SQL, I’ll also talk about the Catalyst optimizer framework, which allows Spark SQL to automatically rewrite query plans to execute more efficiently.
Since its introduction one year ago, Spark SQL has proven to be a highly effective way to speed up existing SQL workloads by leveraging the power of Spark. Spark SQL’s built-in support for reading data from existing Hive warehouses allows HQL users to achieve better performance simply by switching query engines. However even non-SQL workloads can often benefit from the automatic optimizations that Spark SQL can perform. At the core of Spark SQL is the notion of a SchemaRDD, which improves on traditional RDDs by giving them knowledge of how best to manipulate the data that they hold. In addition to rich querying, this structure makes it possible to more efficiently cache and shuffle the data during computations. Furthermore, with the addition of the data sources API, Spark SQL makes it easier to compute over structured data sourced from a wide variety of formats, including Parquet, JSON, Apache Avro, and more. This talk will show examples of how even traditional Spark jobs can benefit from using SchemaRDDs to capture richer information about the structure of the data being processed. It will also describe how the built-in JDBC server opens the loosely structured world of big data to traditional BI tools. Finally, it will reveal the road map for Spark SQL and where the project is heading.
This post will provide a technical overview of Spark's DataFrame API. First, we'll review the DataFrame API and show how to create DataFrames from a variety of data sources such as Hive, RDBMS databases, or structured file formats like Avro. We'll then give example user programs that operate on DataFrames and point out common design patterns. The second half of the talk will focus on the technical implementation of DataFrames, such as the use of Spark SQL's Catalyst optimizer to intelligently plan user programs, and the use of fast binary data structures in Spark's core engine to substantially improve performance and memory use for common types of operations.
As Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk I given an overview of some of the exciting new API's available in Spark 2.0, namely Datasets and Streaming DataFrames/Datasets. Datasets provide an evolution of the RDD API by allowing users to express computation as type-safe lambda functions on domain objects, while still leveraging the powerful optimizations supplied by the Catalyst optimizer and Tungsten execution engine. I will describe the high-level concepts as well as dive into the details of the internal code generation that enable us to provide good performance automatically. Streaming DataFrames/Datasets let developers seamlessly turn their existing structured pipelines into real-time incremental processing engines. I will demonstrate this new API's capabilities and discuss future directions including easy sessionization and event-time-based windowing.
As Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk I given an overview of some of the exciting new API's available in Spark 2.0, namely Datasets and Streaming DataFrames/Datasets. Datasets provide an evolution of the RDD API by allowing users to express computation as type-safe lambda functions on domain objects, while still leveraging the powerful optimizations supplied by the Catalyst optimizer and Tungsten execution engine. I will describe the high-level concepts as well as dive into the details of the internal code generation that enable us to provide good performance automatically. Streaming DataFrames/Datasets let developers seamlessly turn their existing structured pipelines into real-time incremental processing engines. I will demonstrate this new API's capabilities and discuss future directions including easy sessionization and event-time-based windowing.Related Articles:
In Spark 2.0, we introduced Structured Streaming, which allows users to continually and incrementally update your view of the world as new data arrives, while still using the same familiar Spark SQL abstractions. I talk about progress we've made since then on robustness, latency, expressiveness and observability, using examples of production end-to-end continuous applications.
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees. Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis. In this session, Das will walk through a concrete example where - in less than 10 lines - you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He'll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.Learn more: