Spark has made writing big data pipelines much easier than before. But a lot of effort is required to maintain performant and stable data pipelines in production over time. Did I choose the right type of infrastructure for my application? Did I set the Spark configurations correctly? Can my application keep running smoothly as the volume of ingested data grows over time? How to make sure that my pipeline always finishes on time and meets its SLA?
These questions are not easy to answer even for a handful of jobs, and this maintenance work can become a real burden as you scale to dozens, hundreds, or thousands of jobs. This talk will review what we found to be the most useful piece of information and parameters to look at for manual tuning, and the different options available to engineers who want to automate this work, from open-source tools to managed services provided by the data platform or third parties like the Data Mechanics platform.
JY is the CEO and Co-Founder of Data Mechanics, a hassle-free containerized data platform that abstracts away the complexities of Spark and infrastructure management. Prior to that, he was a software engineer and Spark infrastructure team lead at Databricks, growing their cluster-management capabilities from early days to the scale of launching hundreds of thousands of nodes in the cloud every day. JY is passionate about making distributed data technologies 10x more accessible and resource-efficient through automation.
Julien is the CTO and Co-Founder of Data Mechanics, a Y Combinator-backed startup with the mission to make distributed computing accessible to everyone, starting with a serverless Spark platform running on Kubernetes. He previously worked as a data scientist on optimizing BlaBlaCar’s world-leading carpooling marketplace, and led the data team at the website UX optimization platform ContentSquare.