Sameer Agarwal is a Spark Committer and Tech Lead in the Data Platform team at Facebook where he works on building distributed systems and databases that scale across geo-distributed clusters of tens of thousands of machines. Before Facebook, Sameer led the open-source Apache Spark team at Databricks. He received his PhD in Databases from UC Berkeley AMPLab where he worked on BlinkDB, an approximate query engine for Spark.
November 17, 2020 04:00 PM PT
Machine Learning feature engineering is one of the most critical workloads on Spark at Facebook and serves as a means of improving the quality of each of the prediction models we have in production. Over the last year, we've added several features in Spark core/SQL to add first class support for Feature Injection and Feature Reaping in Spark. Feature Injection is an important prerequisite to (offline) ML training where the base features are injected/aligned with new/experimental features, with the goal to improve model performance over time. From a query engine's perspective, this can be thought of as a LEFT OUTER join between the base training table and the feature table which, if implemented naively, could get extremely expensive. As part of this work, we added native support for writing indexed/aligned tables in Spark, wherein IF the data in the base table and the injected feature can be aligned during writes, the join itself can be performed inexpensively.
Feature Reaping is a compute efficient and low latency solution for deleting historical data at sub-partition granularity (i.e., columns or selected map keys), and in order to do it efficiently at our scale, we added a new physical encoding in ORC (called FlatMap) that allowed us to selectively reap/delete specific map keys (features) without performing expensive decoding/encoding and decompression/compression. In this talk, we'll take a deep dive into Spark's optimizer, evaluation engine, data layouts and commit protocols and share how we've implemented these complementary techniques. To this end, we'll discuss several catalyst optimizations to automatically rewrite feature injection/reaping queries as a SQL joins/transforms, describe new ORC physical encodings for storing feature maps, and discuss details of how Spark writes/commits indexed feature tables.
Speakers: Cheng Su and Sameer Agarwal
April 23, 2019 05:00 PM PT
Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day.
In this talk, we'll focus on three areas:
-Scaling Compute: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters.
-Optimizing Core Engine: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second.
-Scaling Users: How we make Spark easy to use, and faster to debug to seamlessly onboard new users.
June 5, 2018 05:00 PM PT
Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community contributors have continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of Spark 2.3 features:
Kubernetes Scheduler Backend
PySpark Performance and Enhancements
Continuous Structured Streaming Processing
DataSource v2 APIs
Spark History Server Performance Enhancements
Session hashtag: #DevSAIS16
December 1, 2013 04:00 PM PT
There is an exponential growth in data that is being collected and stored. This has created an unprecedented demand for processing and analyzing massive amounts of data. Furthermore, analysts and data scientists want results fast to enable explorative data analysis, while more and more applications require data processing to happen in near real time.
In this talk, we present BlinkDB, which uses a radically different approach where queries are always processed in near real time, regardless of the size of the underlying dataset. This is enabled by not looking at all the data, but rather operating on statistical samples of the underlying datasets. More precisely, BlinkDB gives the user the ability to trade between the accuracy of the results and the time it takes to compute queries. The challenge is to ensure that query results are still meaningful, even though only a subset of the data has been processed. Here we leverage recent advances in statistical machine learning and query processing. Using statistical bootstrapping, we can resample the data in parallel to compute confidence intervals that tell the quality of the sampled results. To compute the sampled data in parallel, we build on the Shark distributed query engine, which can compute tens of thousands of queries per second.
This talk will first introduce BlinkDB; then dive into its integration with Shark and the mathematical foundations behind the project.