Shivnath Babu

CTO/Adjunct Professor, Unravel Data Systems/Duke University

Shivnath Babu is the CTO at Unravel Data Systems and an adjunct professor of computer science at Duke University. His research focuses on ease-of-use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. Shivnath cofounded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has won a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award.

Past sessions

Summit 2021 Jeeves Grows Up: An AI Chatbot for Performance and Quality

May 27, 2021 03:15 PM PT

Sarah: CEO-Finance-Report pipeline seems to be slow today. Why
Jeeves: SparkSQL query dbt_fin_model in CEO-Finance-Report is running 53% slower on 2/28/2021. Data skew issue detected. Issue has not been seen in last 90 days.
Jeeves: Adding 5 more nodes to cluster recommended for CEO-Finance-Report to finish in its 99th percentile time of 5.2 hours.
Who is Jeeves? An experienced Spark developer? A seasoned administrator? No, Jeeves is a chatbot created to simplify data operations management for enterprise Spark clusters. This chatbot is powered by advanced AI algorithms and an intuitive conversational interface that together provide answers to get users in and out of problems quickly. Instead of being stuck to screens displaying logs and metrics, users can now have a more refreshing experience via a two-way conversation with their own personal Spark expert.
We presented Jeeves at Spark Summit 2019. In the two years since, Jeeves has grown up a lot. Jeeves can now learn continuously as telemetry information streams in from more and more applications, especially SQL queries. Jeeves now "knows" about data pipelines that have many components. Jeeves can also answer questions about data quality in addition to performance, cost, failures, and SLAs. For example:
Tom: I am not seeing any data for today in my Campaign Metrics Dashboard.
Jeeves: 3/5 validations failed on the cmp_kpis table on 2/28/2021. Run of pipeline cmp_incremental_daily failed on 2/28/2021.
This talk will give an overview of the newer capabilities of the chatbot, and how it now fits in a modern data stack with the emergence of new data roles like analytics engineers and machine learning engineers. You will learn how to build chatbots that tackle your complex data operations challenges.

In this session watch:
Shivnath Babu, CTO/Adjunct Professor, Unravel Data Systems/Duke University


Sarah: My Spark SQL query failed. How can I fix it?

Jeeves: Your Spark query driver went out of memory.

Jeeves: You can set spark.driver.memory to 2.2GB and rerun the query to complete it successfully.

Who is Jeeves? An experienced Spark developer? A seasoned administrator? No, Jeeves is a chatbot created to simplify data operations management for enterprise Spark clusters. This chatbot is powered by advanced AI algorithms and an intuitive conversational interface that together provide answers to get users in and out of performance problems quickly. Instead of just being stuck to screens displaying performance logs and metrics, users can now have more refreshing experience; and consume performance insights via a two-way conversation with their own personal Spark expert.

This talk will give an overview of the chatbot, its architecture, and how it fits in a complex Spark environment. The chatbot connects to a large number of sources to get the data to power its AI algorithms. It can detect anomalies in performance and push key insights via alerts to users when they need them the most. The chatbot can also be told to take actions like creating tickets and making configuration changes. You will learn how to build chatbots that tackle your complex data operations challenges with AI algorithms and automation, keeping a cool head at all times.

Summit 2018 Using Apache Spark to Tune Spark

June 4, 2018 05:00 PM PT

We have developed a workload-aware performance tuning framework for Spark that collects and analyzes telemetry information about all the Spark applications in a cluster. Based on this analysis---which uses batch processing, real-time streaming, and ML analysis that Spark excels at---the framework can identify many ways to improve the overall performance of the Spark workload: (i) by identifying datasets with skewed distributions that are causing significant performance degradation, (ii) by identifying tables and Dataframes that will benefit from caching, (iii) by identifying queries where broadcast joins can be used to improve performance over repartitioned joins, (iv) by identifying the best default values to use at the cluster level for driver and executor container sizes, and (v) by identifying the best cloud machine type for the workload.

Our talk will cover the architecture and algorithms of the framework as well as the effectiveness of the framework in practice. One of the key takeaways from this talk includes how building such a performance framework requires combining algorithms from machine learning with expert knowledge about Spark. We will show through case studies how neither a pure rule-based approach that uses an expert knowledge base nor a pure machine-learning-based approach that uses state-of-the-art algorithms from AI/ML works well in practice. We will also distill our experiences into key insights we learned about building AI/ML applications in Spark.

Session hashtag: #Exp8SAIS

Summit 2018 Using AI to Build a Self-Driving Query Optimizer

June 4, 2018 05:00 PM PT

Spark's Catalyst Optimizer uses cost-based optimization (CBO) to pick the best execution plan for a SparkSQL query. The CBO can choose which join strategy to use (e.g., a broadcast join versus repartitioned join), which table to use as the build side for the hash-join, which join order to use in a multi-way join query, which filter to push down, and others. To get its decisions right, the CBO makes a number of assumptions including availability of up-to-date statistics about the data, accurate estimation of result sizes, and availability of accurate models to estimate query costs.

These assumptions may not hold in real-life settings such as multi-tenant clusters and agile cloud environments; unfortunately, causing the CBO to pick suboptimal execution plans. Dissatisfied users then have to step in and tune the queries manually. In this talk, we will describe how we built Elfino, a Self-Driving Query Optimizer. Elfino tracks each and every query over time---before, during, and after execution---and uses machine learning algorithms to learn from mistakes made by the CBO in estimating properties of the input datasets, intermediate query result sizes, speed of the underlying hardware, and query costs. Elfino can further use an AI algorithm (modeled on multi-armed bandits with expert advice) to "explore and experiment" in a safe and nonintrusive manner using otherwise idle cluster resources. This algorithm enables Elfino to learn about execution plans that the CBO will not consider otherwise.

Our talk will cover how these algorithms help guide the CBO towards better plans in real-life settings while reducing its reliance on assumptions and manual steps like query tuning, setting configuration parameters by trial-and-error, and detecting when statistics are stale. We will also share our experiences with evaluating Elfino in multiple environments and highlight interesting avenues for future work.

Session hashtag: #AI7SAIS

It can be a frustrating experience for an application developer when her application:(a) fails before completion,
(b) does not run quickly or efficiently, or
(c) does not produce correct results.
There are many reasons why such events happen. For example, Spark’s lazy evaluation, while excellent for performance, can make root-cause diagnosis hard. We are working closely with application developers to make diagnosis, tuning, and debugging of Spark applications easy. Our solution is based on holistic analysis and visualization of profiling information gathered from many points in the Spark stack: the program, the execution graph, counters, data samples from RDDs, time series of metrics exported by various end-points in Spark, YARN, as well as the OS, and others. Through a demo-driven walk-through of failed, slow, and incorrect applications taken from everyday use of Spark, we will show how such a solution can improve the productivity of Spark application developers tremendously.

Allocation and usage of memory in Spark is based on an interplay of algorithms at multiple levels: (i) at the resource-management level across various containers allocated by Mesos or YARN, (ii) at the container level among the OS and multiple processes such as the JVM and Python, (iii) at the Spark application level for caching, aggregation, data shuffles, and program data structures, and (iv) at the JVM level across various pools such as the Young and Old Generation as well as the heap versus off-heap. The goal of this talk is to provide application developers and operational staff easy ways to understand the multitude of choices involved in Spark's memory management. This talk is based on an extensive experimental study of Spark on Yarn that was done using a representative suite of applications.

Takeaways from this talk:

- We identify the memory pools used at different levels along with the key configuration parameters (i.e., tuning knobs) that control memory management at each level.
- We show how to collect resource usage and performance metrics for various memory pools, and how to analyze these metrics to identify contention versus underutilization of the pools.
- We show the impact of key memory-pool configuration parameters at the levels of the application, containers, and the JVM. We also highlight tradeoffs in memory usage and running time which are important indicators of resource utilization and application performance.
- We demonstrate how application characteristics, such as shuffle selectivity and input data size, dictate the impact of memory pool settings on application response time, efficiency of resource usage, chances of failure, and performance predictability.
- We summarize our findings as key troubleshooting and tuning guidelines at each level for improving application performance while achieving the highest resource utilization possible in multi-tenant clusters.

Learn more:

  • Deep Dive: Apache Spark Memory Management
  • Spark on YARN: a Deep Dive
  • Spark-­on-­YARN: The Road Ahead