Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production

Download Slides

With more than 700 million monthly active users, Instagram continues to make it easier for people across the globe to join the community, share their experiences, and strengthen connections to their friends and passions. Powering Instagram’s various products requires the use of machine learning, high performance ranking services, and most importantly large amounts of data. At Instagram, we use Apache Spark for several critical production pipelines, including generating labeled training data for our machine learning models. In this session, you’ll learn about how one of Instagram’s largest Spark pipelines has evolved over time in order to process ~300 TB of input and ~90 TB of shuffle data. We’ll discuss the experience of building and managing such a large production pipeline and some tips and tricks we’ve learned along the way to manage Spark at scale. Topics include migrating from RDD to Dataset for better memory efficiency, splitting up long-running pipelines in order to better tune intermediate shuffle data, and dealing with changing data skew over time. Finally, we will also go over some optimizations we have made in order to maintain reliability of this critical data pipeline.
Session hashtag: #EUde0

About Brandon Carl

Brandon is a Software Engineer on the Instagram Engagement Infrastructure team at Facebook. His team focuses on providing the data and services necessary for high performance ranking of various products within Instagram. Prior to Facebook, he worked on streaming data systems at Quantcast and backend web services at Apple. Brandon holds a B.S. in Computer Science from the University of California, San Diego.