Will is a senior architect at Palantir and the global head of its compute platform. Over the past 5 years, he has led or been extensively involved in Palantir’s work on Apache Spark, as well as its migrations to Parquet, to Spark on YARN, and most recently to Spark on Kubernetes. Most of his recent work has been focused on optimizing user-perceived performance via improvements to scheduling and autoscaling.
Kubernetes is an open-source containerization framework that makes it easy to manage applications in isolated environments at scale. In Apache Spark 2.3, Spark introduced support for native integration with Kubernetes. Palantir has been deeply involved with the development of Spark's Kubernetes integration from the beginning, and our largest production deployment now runs an average of ~5 million Spark pods per day, as part of tens of thousands of Spark applications.
Over the course of our adventures in migrating deployments from YARN to Kubernetes, we have overcome a number of performance, cost, & reliability hurdles: differences in shuffle performance due to smaller filesystem caches in containers; Kubernetes CPU limits causing inadvertent throttling of containers that run many Java threads; and lack of support for dynamic allocation leading to resource wastage. We intend to briefly describe our story of developing & deploying Spark-on-Kubernetes, as well as lessons learned from deploying containerized Spark applications in production.
We will also describe our recently open-sourced extension (https://github.com/palantir/k8s-spark-scheduler) to the Kubernetes scheduler to better support Spark workloads & facilitate Spark-aware cluster autoscaling; our limited implementation of dynamic allocation on Kubernetes; and ongoing work that is required to support dynamic resource management & stable performance at scale (i.e., our work with the community on a pluggable external shuffle service API). Our hope is that our lessons learned and ongoing work will help other community members who want to use Spark on Kubernetes for their own workloads.