Sameer Agarwal is a Software Engineer at Databricks working on Spark core and Spark SQL. Previously, he received his PhD in Databases from UC Berkeley AMPLab where he worked on BlinkDB, an approximate query engine for Spark.
Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community contributors have continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2. Continuing forward in that spirit, Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of Spark 2.3 features: Kubernetes Scheduler Backend PySpark Performance and Enhancements Continuous Structured Streaming Processing DataSource v2 APIs Spark History Server Performance Enhancements
There is an exponential growth in data that is being collected and stored. This has created an unprecedented demand for processing and analyzing massive amounts of data. Furthermore, analysts and data scientists want results fast to enable explorative data analysis, while more and more applications require data processing to happen in near real time. In this talk, we present BlinkDB, which uses a radically different approach where queries are always processed in near real time, regardless of the size of the underlying dataset. This is enabled by not looking at all the data, but rather operating on statistical samples of the underlying datasets. More precisely, BlinkDB gives the user the ability to trade between the accuracy of the results and the time it takes to compute queries. The challenge is to ensure that query results are still meaningful, even though only a subset of the data has been processed. Here we leverage recent advances in statistical machine learning and query processing. Using statistical bootstrapping, we can resample the data in parallel to compute confidence intervals that tell the quality of the sampled results. To compute the sampled data in parallel, we build on the Shark distributed query engine, which can compute tens of thousands of queries per second. This talk will first introduce BlinkDB; then dive into its integration with Shark and the mathematical foundations behind the project.