Running Apache Spark on Kubernetes: Best Practices and Pitfalls

Download Slides

Since initial support was added in Apache Spark 2.3, running Spark on Kubernetes has been growing in popularity. Reasons include the improved isolation and resource sharing of concurrent Spark applications on Kubernetes, as well as the benefit to use an homogeneous and cloud native infrastructure for the entire tech stack of a company. But running Spark on Kubernetes in a stable, performant, cost-efficient and secure manner also presents specific challenges. In this talk, JY and Julien will go over lessons learned while building Data Mechanics, a serverless Spark platform powered by Kubernetes.

Topics include:

  • Core concepts and setup of Spark on Kubernetes
  • Configuration tips for performance and efficient resource sharing
  • Spark-app level dynamic allocation and cluster level autoscaling
  • Specificities of Kubernetes for data I/O performance
  • Monitoring and security best practices
  • Limitations and planned future works

 
Try Databricks
« back
About Jean-Yves Stephan

Data Mechanics

JY is the CEO and Co-Founder of Data Mechanics, a hassle-free containerized data platform that abstracts away the complexities of Spark and infrastructure management. Prior to that, he was a software engineer and Spark infrastructure team lead at Databricks, growing their cluster-management capabilities from early days to the scale of launching hundreds of thousands of nodes in the cloud every day. JY is passionate about making distributed data technologies 10x more accessible and resource-efficient through automation.

About Julien Dumazert

Data Mechanics

Julien is the CTO and Co-Founder of Data Mechanics, a Y Combinator-backed startup with the mission to make distributed computing accessible to everyone, starting with a serverless Spark platform running on Kubernetes. He previously worked as a data scientist on optimizing BlaBlaCar's world-leading carpooling marketplace, and led the data team at the website UX optimization platform ContentSquare.