Ergin is a software engineer on BigCompute team at Facebook, where he is working on scaling Spark at Facebook. Previously, he worked on Distributed Data team at Box and focused on caching/storage and prior to that he worked at Microsoft on various teams including System Center, SQL Azure and Bing Ads.
At Facebook, the hardware trend is moving to disaggregated storage and compute architecture in order to improve cost efficiency and scale at finer granular. For Spark, this means that compute machines store and access temporary data (shuffle, spill, persisted RDD/dataframes) on storage machines over the network. Disaggregated storage has many advantages over collocated storage where each node is constrained to the space of its local disks. Over half of our Spark workload is now running on disaggregated infrastructure with individual jobs shuffling more than 300 TB compressed. In this talk, we dive into changes made to Spark to take advantage of Facebook's disaggregated architecture and lessons learned from scaling out and managing these large production clusters. Session hashtag: #Dev7SAIS
Data shuffling is a costly operation. At Facebook, single job shuffles can reach the scale of over 300TB compressed using (relatively cheap) large spinning disks. However, shuffle reads issue large amounts of inefficient, small, random I/O requests to disks and can be a large source of job latency as well as waste of reserved system resources. In order to boost shuffle performance and improve resource efficiency, we have developed Spark-optimized Shuffle (SOS). This shuffle technique effectively converts a large number of small shuffle read requests into fewer large, sequential I/O requests. In this session, we present SOS's multi-stage shuffle architecture and implementation. We will also share our production results and future optimizations. Session hashtag: #Dev5SAIS