Brian has been working on Spark at Facebook for over an year. He is excited to be a part of growing and scaling Spark at Facebook. Previously, he was a researcher at Seoul National University and software engineer at Samsung. He has a PhD from UIUC in distributed systems.
Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).
At Facebook, the hardware trend is moving to disaggregated storage and compute architecture in order to improve cost efficiency and scale at finer granular. For Spark, this means that compute machines store and access temporary data (shuffle, spill, persisted RDD/dataframes) on storage machines over the network. Disaggregated storage has many advantages over collocated storage where each node is constrained to the space of its local disks. Over half of our Spark workload is now running on disaggregated infrastructure with individual jobs shuffling more than 300 TB compressed. In this talk, we dive into changes made to Spark to take advantage of Facebook's disaggregated architecture and lessons learned from scaling out and managing these large production clusters. Session hashtag: #Dev7SAIS
Data shuffling is a costly operation. At Facebook, single job shuffles can reach the scale of over 300TB compressed using (relatively cheap) large spinning disks. However, shuffle reads issue large amounts of inefficient, small, random I/O requests to disks and can be a large source of job latency as well as waste of reserved system resources. In order to boost shuffle performance and improve resource efficiency, we have developed Spark-optimized Shuffle (SOS). This shuffle technique effectively converts a large number of small shuffle read requests into fewer large, sequential I/O requests. In this session, we present SOS's multi-stage shuffle architecture and implementation. We will also share our production results and future optimizations. Session hashtag: #Dev5SAIS