Bo is Sr. Software Engineer II in Uber and working on Spark team. In the past he worked on many streaming technologies
June 24, 2020 05:00 PM PT
Zeus is an efficient, highly scalable and distributed shuffle as a service which is powering all Data processing (Spark and Hive) at Uber. Uber runs one of the largest Spark and Hive clusters on top of YARN in industry which leads to many issues such as hardware failures (Burn out Disks), reliability and scalability challenges. Zeus is built ground up to support hundreds of thousands of jobs and millions of containers which shuffles petabytes of shuffle data. Zeus has changed the paradigm of current external shuffle which resulted in far better performance for shuffle. Although the shuffle data is getting written Remote however the performance is better or the same for most of the Jobs. In this talk we'll take a deep dive into the Zeus architecture and describe how it's deployed at Uber. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism . We will also contrast Zeus performance numbers with different storage systems backed by external shuffle e.g. NFS and HDFS. We will also talk about future roadmap and plans for Zeus.