Chendi Xue is a software engineer from Intel data analytics team. She has six years’ experience in bigdata and cloud system optimization, focusing on computation, storage, network software stack performance analysis and optimization. She participated in the development works including Spark-Sql, Spark-Shuffle optimization, cache implementation, etc. Before that, she worked on Linux Device-Mapper optimization and iscsi optimization during her master degree study.a
The increasing challenge to serve ever-growing data driven by AI and analytics workloads makes disaggregated storage and compute more attractive as it enables companies to scale their storage and compute capacity independently to match data & compute growth rate. Cloud based big data services is gaining momentum as it provides simplified management, elasticity, and pay-as-you-go model. However, Spark shuffle brings performance, scalability and reliability issues in the disaggregated architecture. Shuffle is an I/O intensive operation, which will lead to performance issues if using a typical cloud provisioned volume as shuffle media. Meanwhile, the shuffle operation of different tasks may interfere with each other thus limits Spark's scalability. Moreover, shuffle re-computation in case of compute node failure poses significant overhead for long running jobs. Disaggregating shuffle from compute node is becoming more and more important for cloud-native Spark application.
In this session, we propose a new fully disaggregated shuffle solution that leverage state-of-art hardware technologies including persistent memory and RDMA. It includes: a new pluggable shuffle manager, a persistent memory based distributed storage system, a RDMA powered network library and an innovative approach to use persistent memory as both shuffle media as well as RDMA memory region to reduce additional memory copies and context switch. This new remote shuffle solution improved Spark scalability by disaggregating Spark shuffle from compute node to a high-performance distributed storage, improved spark shuffle performance with high speed persistent memory and low latency RDMA network, and improved reliability by providing shuffle data replication and fault-tolerant optimization. Experiment performance numbers will be also presented, which demonstrates up to 10x performance speedup over traditional shuffle solution and three orders of magnitude reduction in terms of shuffle read block time.