Mayank Bansal

Staff Engineer , Uber Inc

Mayank Bansal is currently working as a Staff engineer at Uber in data infrastructure team. He is co-author of Peloton. He is Apache Hadoop Committer and Oozie PMC and Committer. Previously he was working at ebay in hadoop platform team leading YARN and MapReduce effort. Prior to that he was working at Yahoo and worked on Oozie.

Past sessions

Summit 2021 Tale of Scaling Zeus to Petabytes of Shuffle Data @Uber

May 27, 2021 12:10 PM PT

Zeus is an efficient, highly scalable, and distributed shuffle as a service that is powering all Data processing (Spark and Hive) at Uber. Uber runs one of the largest Spark and Hive clusters on top of YARN in the industry which leads to many issues such as hardware failures (Burn out Disks), reliability, and scalability challenges. Last year, we discussed with this forum about Zeus service architecture traits and early results. Since then we made great progress, we open-sourced Zeus last year and deployed it to our all analytics clusters.

In this talk, we want to talk about how we scaled the Zeus service to all the spark workloads, scaled to billions of shuffle messages and petabytes of shuffle data at uber. We will also talk about the strategies which we took to roll out Zeus to this massive scale without users noticing any difference or any service disruption. We also want to talk about further improvements which are on the horizon for Zeus as well as the performance and reliability improvements that have been done in future releases.

In this session watch:
Mayank Bansal, Staff Engineer , Uber Inc

[daisna21-sessions-od]

Zeus is an efficient, highly scalable and distributed shuffle as a service which is powering all Data processing (Spark and Hive) at Uber. Uber runs one of the largest Spark and Hive clusters on top of YARN in industry which leads to many issues such as hardware failures (Burn out Disks), reliability and scalability challenges. Zeus is built ground up to support hundreds of thousands of jobs and millions of containers which shuffles petabytes of shuffle data. Zeus has changed the paradigm of current external shuffle which resulted in far better performance for shuffle. Although the shuffle data is getting written Remote however the performance is better or the same for most of the Jobs. In this talk we'll take a deep dive into the Zeus architecture and describe how it's deployed at Uber. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism . We will also contrast Zeus performance numbers with different storage systems backed by external shuffle e.g. NFS and HDFS. We will also talk about future roadmap and plans for Zeus.