Cheng got in touch with Spark since late 2013 and joined Databricks in early 2014 as one of the main developers behind Spark SQL. Now he’s a committer of Apache Spark and Apache Parquet. His current areas of interest include databases and programming languages.
Have you ever hit mysterious random process hangs, performance regressions, or OOM errors that leave barely any useful traces, yet hard or expensive to reproduce? No matter how tricky the bugs are, they always leave some breadcrumbs along the way. All you need is the skills, tools, and knowledge to trace them. At Databricks, millions of clusters consisting of tens of millions of instances are being launched every month to host our customers' workloads. While being exciting, this is also a perfect environment for bugs to lurk in either our customers' workloads or our own platform. This talk brings you some lessons and case studies we learned from real-life bug-hunting experiences.