SOS: Optimizing Shuffle I/O - Databricks

SOS: Optimizing Shuffle I/O

Download Slides

Data shuffling is a costly operation. At Facebook, single job shuffles can reach the scale of over 300TB compressed using (relatively cheap) large spinning disks. However, shuffle reads issue large amounts of inefficient, small, random I/O requests to disks and can be a large source of job latency as well as waste of reserved system resources. In order to boost shuffle performance and improve resource efficiency, we have developed Spark-optimized Shuffle (SOS). This shuffle technique effectively converts a large number of small shuffle read requests into fewer large, sequential I/O requests.

In this session, we present SOS’s multi-stage shuffle architecture and implementation. We will also share our production results and future optimizations.

Session hashtag: #Dev5SAIS

About Brian Cho

Brian has been working on Spark at Facebook for over an year. He is excited to be a part of growing and scaling Spark at Facebook. Previously, he was a researcher at Seoul National University and software engineer at Samsung. He has a PhD from UIUC in distributed systems.

About Ergin Seyfe

Ergin is a software engineer on BigCompute team at Facebook, where he is working on scaling Spark at Facebook. Previously, he worked on Distributed Data team at Box and focused on caching/storage and prior to that he worked at Microsoft on various teams including System Center, SQL Azure and Bing Ads.