Zoe is a software engineer on LinkedIn’s Spark team, where she supports Spark use-cases at LinkedIn and tackle various platform challenges, mostly focusing on Spark tracking, metrics and tuning. Previously, she went to UC Berkeley and Carnegie Mellon University.
Over the past 3 years, Apache Spark has transitioned from an experiment to the dominant production compute engine at LinkedIn. Within the past year, we have seen a 3X growth of daily Spark applications. Nowadays, it powers many use cases ranging from AI to data engineering, to analytics. 1000+ active Spark users launch 10s of thousands of Spark jobs on our clusters processing PBs of data on a daily basis. Throughout this journey, we have faced multiple challenges in scaling our Spark compute infrastructure and empowering our fast-growing users to develop working and efficient Spark applications: Remove the major infrastructure scaling bottlenecks by optimizing core Spark components such as shuffle and Spark History Server Balance between the limited compute resources and users' ever increasing compute demands by improving cluster resource scheduler Improve users' development productivity without falling deep into the 'support trap' by automating job failure root cause analysis Boost users' Spark jobs efficiency without hurdling their development agility that comes with repeated tuning of the jobs. In this talk, we will share the work we have done that tackles these challenges and what we have learnt during this process.