We have developed a workload-aware performance tuning framework for Spark that collects and analyzes telemetry information about all the Spark applications in a cluster. Based on this analysis—which uses batch processing, real-time streaming, and ML analysis that Spark excels at—the framework can identify many ways to improve the overall performance of the Spark workload: (i) by identifying datasets with skewed distributions that are causing significant performance degradation, (ii) by identifying tables and Dataframes that will benefit from caching, (iii) by identifying queries where broadcast joins can be used to improve performance over repartitioned joins, (iv) by identifying the best default values to use at the cluster level for driver and executor container sizes, and (v) by identifying the best cloud machine type for the workload.
Our talk will cover the architecture and algorithms of the framework as well as the effectiveness of the framework in practice. One of the key takeaways from this talk includes how building such a performance framework requires combining algorithms from machine learning with expert knowledge about Spark. We will show through case studies how neither a pure rule-based approach that uses an expert knowledge base nor a pure machine-learning-based approach that uses state-of-the-art algorithms from AI/ML works well in practice. We will also distill our experiences into key insights we learned about building AI/ML applications in Spark.
Session hashtag: #Exp8SAIS
Adrian Popescu is a data engineer at Unravel Data Systems working on performance profiling and optimization of Spark applications. He has more than eight years of experience building and profiling data management applications. He holds a PhD in computer science from EPFL, a master of applied science from the University of Toronto, and a bachelor of science from University Politehnica, Bucharest. His PhD thesis focused on modeling the runtime performance of a class of analytical workloads that include iterative tasks executing on in-memory graph processing engines (Apache Giraph BSP), and SQL-on-Hadoop queries executing at scale on Hive.
Shivnath Babu is the CTO at Unravel Data Systems and an adjunct professor of computer science at Duke University. His research focuses on ease-of-use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. Shivnath cofounded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has won a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award.