Jian Zhang is a senior software engineering manager at Intel, where he and his team primarily focus on Open Source bigdata analytics software development and optimizations, and build reference solutions for customers. He has over 10 years of experience in performance analysis and optimization for many open source projects such as Xen, KVM, Swift, Ceph, Spark, and Hadoop and benchmarking workloads such as those from SPEC or TPC. He earned a master’s degree in Computer Science and Engineering at Shanghai Jiaotong University.
The increasing challenge to serve ever-growing data driven by AI and analytics workloads makes disaggregated storage and compute more attractive as it enables companies to scale their storage and compute capacity independently to match data & compute growth rate. Cloud based big data services is gaining momentum as it provides simplified management, elasticity, and pay-as-you-go model. However, Spark shuffle brings performance, scalability and reliability issues in the disaggregated architecture. Shuffle is an I/O intensive operation, which will lead to performance issues if using a typical cloud provisioned volume as shuffle media. Meanwhile, the shuffle operation of different tasks may interfere with each other thus limits Spark's scalability. Moreover, shuffle re-computation in case of compute node failure poses significant overhead for long running jobs. Disaggregating shuffle from compute node is becoming more and more important for cloud-native Spark application.
In this session, we propose a new fully disaggregated shuffle solution that leverage state-of-art hardware technologies including persistent memory and RDMA. It includes: a new pluggable shuffle manager, a persistent memory based distributed storage system, a RDMA powered network library and an innovative approach to use persistent memory as both shuffle media as well as RDMA memory region to reduce additional memory copies and context switch. This new remote shuffle solution improved Spark scalability by disaggregating Spark shuffle from compute node to a high-performance distributed storage, improved spark shuffle performance with high speed persistent memory and low latency RDMA network, and improved reliability by providing shuffle data replication and fault-tolerant optimization. Experiment performance numbers will be also presented, which demonstrates up to 10x performance speedup over traditional shuffle solution and three orders of magnitude reduction in terms of shuffle read block time.