Tracing the Breadcrumbs: Apache Spark Workload Diagnostics

Have you ever hit mysterious random process hangs, performance regressions, or OOM errors that leave barely any useful traces, yet hard or expensive to reproduce? No matter how tricky the bugs are, they always leave some breadcrumbs along the way. All you need is the skills, tools, and knowledge to trace them. At Databricks, millions of clusters consisting of tens of millions of instances are being launched every month to host our customers’ workloads. While being exciting, this is also a perfect environment for bugs to lurk in either our customers’ workloads or our own platform. This talk brings you some lessons and case studies we learned from real-life bug-hunting experiences.

Try Databricks
« back
About Cheng Lian


Cheng got in touch with Spark since late 2013 and joined Databricks in early 2014 as one of the main developers behind Spark SQL. Now he's a committer of Apache Spark and Apache Parquet. His current areas of interest include databases and programming languages.

About Kris Mok


Kris Mok is a software engineer at Databricks. He works on various components of Spark SQL, with interest on optimizer and code generation. Previously, he worked on JVM implementations, including OpenJDK HotSpot VM at Alibaba and Oracle and Zing VM at Azul, and had broad interest in programming language design and implementation.