Managing Millions of Tests Using Databricks

Databricks Runtime is the execution environment that powers millions of VMs running data engineering and machine learning workloads daily in Databricks. Inside Databricks, we run millions of tests per day to ensure the quality of different versions of Databricks Runtime. Due to the large number of tests executed daily, we have been continuously facing the challenge of effective test result monitoring and problem triaging. In this talk, I am going to share our experience of building the automated test monitoring and reporting system using Databricks. I will cover how we ingest data from different data sources like CI systems and Bazel build metadata to Delta, and how we analyze test results and report failures to their owners through Jira. I will also show you how this system empowers us to build different types of reports that effectively track the quality of changes made to Databricks Runtime.

About Yin Huai

Yin is a Staff Software Engineer at Databricks. His work focuses on designing and building Databricks Runtime container environment, and its associated testing and release infrastructures. Before joining Databricks, he was a PhD student at The Ohio State University and was advised by Xiaodong Zhang. Yin is also an Apache Spark PMC member.