Shivnath Babu is an Associate Professor of Computer Science at Duke University and the CTO at Unravel Data Systems. His research focuses on ease-of-use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. Shivnath co-founded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke which has been downloaded by over 100 companies. Shivnath has received a U.S. National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award.
We have developed a workload-aware performance tuning framework for Spark that collects and analyzes telemetry information about all the Spark applications in a cluster. Based on this analysis---which uses batch processing, real-time streaming, and ML analysis that Spark excels at---the framework can identify many ways to improve the overall performance of the Spark workload: (i) by identifying datasets with skewed distributions that are causing significant performance degradation, (ii) by identifying tables and Dataframes that will benefit from caching, (iii) by identifying queries where broadcast joins can be used to improve performance over repartitioned joins, (iv) by identifying the best default values to use at the cluster level for driver and executor container sizes, and (v) by identifying the best cloud machine type for the workload. Our talk will cover the architecture and algorithms of the framework as well as the effectiveness of the framework in practice. One of the key takeaways from this talk includes how building such a performance framework requires combining algorithms from machine learning with expert knowledge about Spark. We will show through case studies how neither a pure rule-based approach that uses an expert knowledge base nor a pure machine-learning-based approach that uses state-of-the-art algorithms from AI/ML works well in practice. We will also distill our experiences into key insights we learned about building AI/ML applications in Spark.
Spark's Catalyst Optimizer uses cost-based optimization (CBO) to pick the best execution plan for a SparkSQL query. The CBO can choose which join strategy to use (e.g., a broadcast join versus repartitioned join), which table to use as the build side for the hash-join, which join order to use in a multi-way join query, which filter to push down, and others. To get its decisions right, the CBO makes a number of assumptions including availability of up-to-date statistics about the data, accurate estimation of result sizes, and availability of accurate models to estimate query costs. These assumptions may not hold in real-life settings such as multi-tenant clusters and agile cloud environments; unfortunately, causing the CBO to pick suboptimal execution plans. Dissatisfied users then have to step in and tune the queries manually. In this talk, we will describe how we built Elfino, a Self-Driving Query Optimizer. Elfino tracks each and every query over time---before, during, and after execution---and uses machine learning algorithms to learn from mistakes made by the CBO in estimating properties of the input datasets, intermediate query result sizes, speed of the underlying hardware, and query costs. Elfino can further use an AI algorithm (modeled on multi-armed bandits with expert advice) to "explore and experiment" in a safe and nonintrusive manner using otherwise idle cluster resources. This algorithm enables Elfino to learn about execution plans that the CBO will not consider otherwise. Our talk will cover how these algorithms help guide the CBO towards better plans in real-life settings while reducing its reliance on assumptions and manual steps like query tuning, setting configuration parameters by trial-and-error, and detecting when statistics are stale. We will also share our experiences with evaluating Elfino in multiple environments and highlight interesting avenues for future work.
It can be a frustrating experience for an application developer when her application:(a) fails before completion, (b) does not run quickly or efficiently, or (c) does not produce correct results. There are many reasons why such events happen. For example, Spark’s lazy evaluation, while excellent for performance, can make root-cause diagnosis hard. We are working closely with application developers to make diagnosis, tuning, and debugging of Spark applications easy. Our solution is based on holistic analysis and visualization of profiling information gathered from many points in the Spark stack: the program, the execution graph, counters, data samples from RDDs, time series of metrics exported by various end-points in Spark, YARN, as well as the OS, and others. Through a demo-driven walk-through of failed, slow, and incorrect applications taken from everyday use of Spark, we will show how such a solution can improve the productivity of Spark application developers tremendously.
Allocation and usage of memory in Spark is based on an interplay of algorithms at multiple levels: (i) at the resource-management level across various containers allocated by Mesos or YARN, (ii) at the container level among the OS and multiple processes such as the JVM and Python, (iii) at the Spark application level for caching, aggregation, data shuffles, and program data structures, and (iv) at the JVM level across various pools such as the Young and Old Generation as well as the heap versus off-heap. The goal of this talk is to provide application developers and operational staff easy ways to understand the multitude of choices involved in Spark's memory management. This talk is based on an extensive experimental study of Spark on Yarn that was done using a representative suite of applications. Takeaways from this talk: * We identify the memory pools used at different levels along with the key configuration parameters (i.e., tuning knobs) that control memory management at each level. * We show how to collect resource usage and performance metrics for various memory pools, and how to analyze these metrics to identify contention versus underutilization of the pools. * We show the impact of key memory-pool configuration parameters at the levels of the application, containers, and the JVM. We also highlight tradeoffs in memory usage and running time which are important indicators of resource utilization and application performance. * We demonstrate how application characteristics, such as shuffle selectivity and input data size, dictate the impact of memory pool settings on application response time, efficiency of resource usage, chances of failure, and performance predictability. * We summarize our findings as key troubleshooting and tuning guidelines at each level for improving application performance while achieving the highest resource utilization possible in multi-tenant clusters.