Have you ever hit mysterious random process hangs, performance regressions, or OOM errors that leave barely any useful traces, yet hard or expensive to reproduce? No matter how tricky the bugs are, they always leave some breadcrumbs along the way. All you need is the skills, tools, and knowledge to trace them. At Databricks, millions of clusters consisting of tens of millions of instances are being launched every month to host our customers’ workloads. While being exciting, this is also a perfect environment for bugs to lurk in either our customers’ workloads or our own platform. This talk brings you some lessons and case studies we learned from real-life bug-hunting experiences.
Cheng is a PMC member and committer of Apache Spark and a senior engineer in Databricks. He has been working on Apache Spark since 0.7.8 and is one of the major contributors of Spark SQL.
Kris Mok is a software engineer at Databricks. He works on various components of Spark SQL, with interest on optimizer and code generation. Previously, he worked on JVM implementations, including OpenJDK HotSpot VM at Alibaba and Oracle and Zing VM at Azul, and had broad interest in programming language design and implementation.