Julien Dumazert - Databricks

Julien Dumazert

CTO and Co-Founder, Data Mechanics

Julien is the CTO and Co-Founder of Data Mechanics, a YCombinator-backed startup with the mission to automate the often tedious mechanical work performed manually by data engineers today, starting with Spark performance and stability tuning. He previously worked as a data scientist on optimizing BlaBlaCar’s world-leading carpooling marketplace, and led the data team at the website UX optimization platform ContentSquare. He graduated from Ecole Polytechnique and ETH Zurich.


Running Apache Spark on Kubernetes: Best Practices and PitfallsSummit 2020

Since initial support was added in Apache Spark 2.3, running Spark on Kubernetes has been growing in popularity. Reasons include the improved isolation and resource sharing of concurrent Spark applications on Kubernetes, as well as the benefit to use an homogeneous and cloud native infrastructure for the entire tech stack of a company. But running Spark on Kubernetes in a stable, performant, cost-efficient and secure manner also presents specific challenges. In this talk, JY and Julien will go over lessons learned while building Data Mechanics, a serverless Spark platform powered by Kubernetes. Topics include:

  • Scalability bottlenecks of Spark on Kubernetes
  • Optimizations for highly concurrent interactive use cases
  • Specificities of data I/O on Kubernetes
  • Secure access to data via Kubernetes role-based access control
  • Automated job configuration tuning.


How to Automate Performance Tuning for Apache SparkSummit Europe 2019

Spark has made writing big data pipelines much easier than before. But a lot of effort is required to maintain performant and stable data pipelines in production over time. Did I choose the right type of infrastructure for my application? Did I set the Spark configurations correctly? Can my application keep running smoothly as the volume of ingested data grows over time? How to make sure that my pipeline always finishes on time and meets its SLA?

These questions are not easy to answer even for a handful of jobs, and this maintenance work can become a real burden as you scale to dozens, hundreds, or thousands of jobs. This talk will review what we found to be the most useful piece of information and parameters to look at for manual tuning, and the different options available to engineers who want to automate this work, from open-source tools to managed services provided by the data platform or third parties like the Data Mechanics platform.