At Facebook, the hardware trend is moving to disaggregated storage and compute architecture in order to improve cost efficiency and scale at finer granular. For Spark, this means that compute machines store and access temporary data (shuffle, spill, persisted RDD/dataframes) on storage machines over the network. Disaggregated storage has many advantages over collocated storage where each node is constrained to the space of its local disks. Over half of our Spark workload is now running on disaggregated infrastructure with individual jobs shuffling more than 300 TB compressed.
In this talk, we dive into changes made to Spark to take advantage of Facebook’s disaggregated architecture and lessons learned from scaling out and managing these large production clusters.
Session hashtag: #Dev7SAIS
Brian has been working on Spark at Facebook for over an year. He is excited to be a part of growing and scaling Spark at Facebook. Previously, he was a researcher at Seoul National University and software engineer at Samsung. He has a PhD from UIUC in distributed systems.
Ergin is a software engineer on BigCompute team at Facebook, where he is working on scaling Spark at Facebook. Previously, he worked on Distributed Data team at Box and focused on caching/storage and prior to that he worked at Microsoft on various teams including System Center, SQL Azure and Bing Ads.