Min Shen is a tech lead at LinkedIn. His team’s focus is to build and scale LinkedIn’s general purpose batch compute engine based on Apache Spark. The team empowers multiple use cases at LinkedIn ranging from data explorations, data engineering, to ML model training. Prior to this, Min mainly worked on Apache YARN. He holds a PhD degree in Computer Science from University of Illinois at Chicago.
Over the past 3 years, Apache Spark has transitioned from an experiment to the dominant production compute engine at LinkedIn. Within the past year, we have seen a 3X growth of daily Spark applications. Nowadays, it powers many use cases ranging from AI to data engineering, to analytics. 1000+ active Spark users launch 10s of thousands of Spark jobs on our clusters processing PBs of data on a daily basis. Throughout this journey, we have faced multiple challenges in scaling our Spark compute infrastructure and empowering our fast-growing users to develop working and efficient Spark applications: Remove the major infrastructure scaling bottlenecks by optimizing core Spark components such as shuffle and Spark History Server Balance between the limited compute resources and users' ever increasing compute demands by improving cluster resource scheduler Improve users' development productivity without falling deep into the 'support trap' by automating job failure root cause analysis Boost users' Spark jobs efficiency without hurdling their development agility that comes with repeated tuning of the jobs. In this talk, we will share the work we have done that tackles these challenges and what we have learnt during this process.