Adrian Popescu is a data engineer at Unravel Data Systems working on performance profiling and optimization of Spark applications. He has more than eight years of experience building and profiling data management applications. He holds a PhD in computer science from EPFL, a master of applied science from the University of Toronto, and a bachelor of science from University Politehnica, Bucharest. His PhD thesis focused on modeling the runtime performance of a class of analytical workloads that include iterative tasks executing on in-memory graph processing engines (Apache Giraph BSP), and SQL-on-Hadoop queries executing at scale on Hive.
June 4, 2018 05:00 PM PT
We have developed a workload-aware performance tuning framework for Spark that collects and analyzes telemetry information about all the Spark applications in a cluster. Based on this analysis---which uses batch processing, real-time streaming, and ML analysis that Spark excels at---the framework can identify many ways to improve the overall performance of the Spark workload: (i) by identifying datasets with skewed distributions that are causing significant performance degradation, (ii) by identifying tables and Dataframes that will benefit from caching, (iii) by identifying queries where broadcast joins can be used to improve performance over repartitioned joins, (iv) by identifying the best default values to use at the cluster level for driver and executor container sizes, and (v) by identifying the best cloud machine type for the workload.
Our talk will cover the architecture and algorithms of the framework as well as the effectiveness of the framework in practice. One of the key takeaways from this talk includes how building such a performance framework requires combining algorithms from machine learning with expert knowledge about Spark. We will show through case studies how neither a pure rule-based approach that uses an expert knowledge base nor a pure machine-learning-based approach that uses state-of-the-art algorithms from AI/ML works well in practice. We will also distill our experiences into key insights we learned about building AI/ML applications in Spark.
Session hashtag: #Exp8SAIS
June 4, 2018 05:00 PM PT
Spark's Catalyst Optimizer uses cost-based optimization (CBO) to pick the best execution plan for a SparkSQL query. The CBO can choose which join strategy to use (e.g., a broadcast join versus repartitioned join), which table to use as the build side for the hash-join, which join order to use in a multi-way join query, which filter to push down, and others. To get its decisions right, the CBO makes a number of assumptions including availability of up-to-date statistics about the data, accurate estimation of result sizes, and availability of accurate models to estimate query costs.
These assumptions may not hold in real-life settings such as multi-tenant clusters and agile cloud environments; unfortunately, causing the CBO to pick suboptimal execution plans. Dissatisfied users then have to step in and tune the queries manually. In this talk, we will describe how we built Elfino, a Self-Driving Query Optimizer. Elfino tracks each and every query over time---before, during, and after execution---and uses machine learning algorithms to learn from mistakes made by the CBO in estimating properties of the input datasets, intermediate query result sizes, speed of the underlying hardware, and query costs. Elfino can further use an AI algorithm (modeled on multi-armed bandits with expert advice) to "explore and experiment" in a safe and nonintrusive manner using otherwise idle cluster resources. This algorithm enables Elfino to learn about execution plans that the CBO will not consider otherwise.
Our talk will cover how these algorithms help guide the CBO towards better plans in real-life settings while reducing its reliance on assumptions and manual steps like query tuning, setting configuration parameters by trial-and-error, and detecting when statistics are stale. We will also share our experiences with evaluating Elfino in multiple environments and highlight interesting avenues for future work.
Session hashtag: #AI7SAIS