Magnet Shuffle Service: Push-based Shuffle at LinkedIn

May 26, 2021 12:05 PM (PT)

Download Slides

The number of daily Apache Spark applications at LinkedIn has increased by 3X in the past year. The shuffle process alone, which is one of the most costly operators in batch computation, is processing PBs of data and billions of blocks daily in our clusters. With such a rapid increase of Apache Spark workloads, we quickly realized that the shuffle process can become a severe bottleneck for both infrastructure scalability and workloads efficiency. In our production clusters, we have observed both reliability issues due to shuffle fetch connection failures and efficiency issues due to the random reads of small shuffle blocks on HDDs.

To tackle those challenges and optimize shuffle performance in Apache Spark, we have developed Magnet shuffle service, a push-based shuffle mechanism that works natively with Apache Spark. Our paper on Magnet has been accepted by VLDB 2020. In this talk, we will introduce how push-based shuffle can drastically increase shuffle efficiency when compared with the existing pull-based shuffle. In addition, by combining push-based shuffle and pull-based shuffle, we show how Magnet shuffle service helps to harden shuffle infrastructure at LinkedIn scale by both reducing shuffle related failures and removing scaling bottlenecks. Furthermore, we will share our experiences of productionizing Magnet at LinkedIn to process close to 10 PB of daily shuffle data.

In this session watch:
Chandni Singh, Senior Software Engineer, LinkedIn
Min Shen, Tech Lead, LinkedIn

 

Chandni Singh

I have been working on the Data Infrastructure projects in the Hadoop ecosystem since 2013. I am an Apache Hadoop committer and an Apache Apex PMC. Currently, I work on Apache Spark at LinkedIn. I’v...
Read more

Min Shen

Min Shen is a tech lead at LinkedIn. His team's focus is to build and scale LinkedIn's general purpose batch compute engine based on Apache Spark. The team empowers multiple use cases at LinkedIn rang...
Read more