Managing Millions of Tests Using Databricks

May 26, 2021 05:00 PM (PT)

Download Slides

Databricks Runtime is the execution environment that powers millions of VMs running data engineering and machine learning workloads daily in Databricks. Inside Databricks, we run millions of tests per day to ensure the quality of different versions of Databricks Runtime. Due to the large number of tests executed daily, we have been continuously facing the challenge of effective test result monitoring and problem triaging. In this talk, I am going to share our experience of building the automated test monitoring and reporting system using Databricks. I will cover how we ingest data from different data sources like CI systems and Bazel build metadata to Delta, and how we analyze test results and report failures to their owners through Jira. I will also show you how this system empowers us to build different types of reports that effectively track the quality of changes made to Databricks Runtime.

In this session watch:
Yin Huai, Software Engineer, Databricks

 

Transcript

Yin Huai: Hello everyone, welcome to my talk. Today, I’m going to talk about how we use Databricks to manage millions of tests inside Databricks. First, let me introduce myself. My name is Yin Huai. I’m a staff software engineer at Databricks, and I’m a runtime guru. During my early days at Databricks, my work was primarily on Spark SQL. In the past three years, I have been focusing on designing and building Databricks Runtime container environment and its associated testing and release infrastructures.
Databricks operates a global scale and a multi-cloud data platform. Our platform control play is deployed to different cloud regions in order to support our thousands of customers around the world. In Databricks, we maintain a deep technical stack with various systems to power our data platform and to keep our data platform at its fastest pace. To scale efficiently in terms of those providing more values to our customers and helping us to stand the increasing loads, we have to keep investing into the factory that builds our data platform. In this factory, we build out a cloud abstraction layer for key integrations with cloud providers. This layer, Kubernetes and Docker combined provide a cloud agnostic environment for developing and operating services powering our data platform. There are also a set of foundational services. They provide common functionalities like monitoring, access control, and billing to other services in our platform. Then at the top of our stack, there is a layer of services that handle various operations issued by our customer. And there is a Databricks Runtime, which provides standardized computing environments to all clusters powering customer’s workload.
Besides having a deep technical stack, Databricks unified data platform has a wide surface area to our customers. With Databricks Runtime, customers build different kinds of workloads using SQL, Python, Scala and R. The entire Databricks Runtime cluster is presented to our customers. They can build workloads from accessing functionalities provided by the operating system to just use high-level declarative languages like SQL. Also with services like job scheduler, SQL analytics, and MLFlow Databricks allow our different types of customer to work on different types of a data problems on a single unified platform.
Today, there are millions of a Databricks Runtime clusters running. Our customers use Databricks Runtime to solve different kinds of data problems in various domains. As you can tell our data platform is a complex system, and there are lots of different functionalities that our customers can use to power their mission critical workloads. The quality of our platform is crucial to the success of our customers. In order to ensure the quality of our system and keep improve the quality, testing is key. Inside Databricks, we are testing our system at a decent scale. The graph here shows the average total test execution time per day for each week since October 2020. We now run 2.8 tests a year, 2.8 years of tests each day. There are 54 million tests running per day or 630 tests per second. The [inaudible] has more than doubled in just a little bit more than six months.
There have been several challenges on our past to run tests at this scale. First, we need to execute tests at a scale. Test taking Databricks run time as an example. We have more than 150 CI jobs and thousands of CI job runs per day to test all version of the Databricks Runtime. While running a high number of jobs frequently, we often hit scalability each with Jenkins. To be able to executes tests at scale, we have been building an in-house CI system, and gradually migrating CI jobs to that. As we have the new CI system that helps us execute tests at scale, we also need to be able to dispatch CI jobs at scale. Also, we’re using a [inaudible], and the lowest free common CI testing is high. To address this issue, we’ve built Github webhook receiver and consumer to dispatch CI jobs at scale.
With the growth of our business, we are going to write and execute more and more tests. We need to be able to handle test results at scale. Even the event of a single test failure is very rare, since we run a large number of tests, and there are tests that depend on external systems like cloth provider, there will be test failures every day. Let’s do some simplified estimation. When a test fails once a million time, we will have 54 failure per day. If a test fails once in every 100,000 times, we will have more than 500 tests failures per day. As we keep offering new functionalities and services to our customers, the number of tests will keep increasing. We will need to figure out a sustainable approach to keep tracking the results of tests we run.
To tackle this challenge, we need a system that automatically triages test failures to the right owner in a developer-friendly form. To help us design this system, first we established the three guiding principles. First, the system should detect and report test failures automatically. If any step requires human intervention, this system will not be able to produce consistent results, as different persons may handle the same test with different standards. Also, we will not be able to scale the system with the scale of the problem, if we have human-in-the-loop. Second, this is a need to connect the problem with the right owner. Every test should have owner, but when a test fails, it is not necessarily the owner who needs to take actions. For example, when they run a [inaudible] test, it’s possible it failed because a dependent service doesn’t work. If we always triage the test failure to the test owner, we may frequently direct noise to the test owner. When the signal to noise ratio is low, the system and the reports created by it will quickly lose the trials of engineers. Then those reports will become useless because engineer will not treat those seriously or simply ignore them. In order to have a system that works in practice, this is a need to understand the failure and triage different type of failure to the right owner.
Once we can come at the problems with the right owner, we also need to make sure the owner understands what is happening and how to start the investigation when they see the reports. In Databricks, we use JIRA to track issues. So we need to ensure reported Jira tickets are well-organized and information surfaced thorough JIRA tickets do not overwhelm our engineers. So our engineers can know the number of unique issues at high-level and also know how to get more information to debug problems. Finally, it’s possible the failure attribution needs update. We also want to empower our engineers, who make corrections in a self-service way.
In the rest of this talk, I will show you how we build a SAS failure reporting system. I will first explain how we approach the problem. Then I will give you a system overview to help you understand how this system is implemented.
First, let’s see what is a problem we need to solve. For our problem, we need to collect data from three sources. We need to get test results from Jenkins and our in-house CI system. Also, we need to establish the mapping from test to to test owner. Since we use Bazel as the build tool inside Databricks, we will need to figure out how we get the owner information for each test target defined in Bazel. Then with test results and the mapping from test to test owner, we need to generate test failure reports and send them to the issue tracking system. We need to generate test failure reports that fit the abstraction of JIRA tickets with owner attribution surfaced thorough setting JIRA components. As you can see here, we’re essentially looking at a data problem that required three kinds of data pipelines to solve.
The first data pipeline is used to collect test results from CI systems. The second data pipeline is used to establish the test to owner map. For both pipelines, we need to persist the data in a form that later we can use to generate failure reports. The last data pipeline is to use persist data from previous two pipelines and direct failure into a JIRA as the destination. Now, as it’s clear what we need to do, let’s see how we get errors and implement it.
To implement those three pipelines, we need to first pick the right tools. There are three key decisions to make. First, we need to determine where to host the data pipelines. As a problem is how to design and implement data pipelines, there’s no better place than Databricks studios. Databricks allow us to easily manage our data pipelines and use rich functionalities or analyze data. Second, we need to figure out how to access CI’s end result in a Bazel build metadata in Databricks. Because we are using Databricks to house our data pipeline, naturally we’re using Databricks Runtime to process our data. So we’ll use Sparks data source API here. And a finally, we need to decide how we store CI system’s results and Bazel build metadata in Databricks. As we’re using Databricks, this decision is also easy to make. We use Delta to store datasets. Delta makes continuous data ingestion simple, which I will talk about more very soon. So now we have got our tool set. We’re ready to look into our data pipelines.
First, we look at how we build data pipelines to establish test results table. For our in-house CI system, we simply use Spark’s JDBC connector to load data directly from our CI system’s database. For Jenkins, we build a Spark Jenkins connector to allow us query Jenkins build and test data using SQL and a data frame API. This data source expels Jenkins data through three wheels. The jobs will allow you to acquire available Jenkins jobs. The builds will allow you to access the build level metadata, like build status, build start time, and build parameters. The tests will allow them to query detail test results of selected builds exposed by Jenkins JUnit Plugin. Through this build, you will be able to see things like error messages and the stacktraces.
As we can load our CI systems data now, we can keep building the data pipeline. So Delta is the destination table format. Using Delta here makes building the continuous data ingestion pipeline easy. When accessing Jenkins, we can now simply ask Jenkins to return build finished after a given timestamped. So we need to query Jenkins to give us most recent builds. Then we’ll use MERGE INTO to only enter new records to the test results table. Also, when we build the data pipelines, the test results are partitioned by different jobs. So we take one [inaudible] of the transactional supportive data to how parallel data pipelines, running to ingest data from different Jenkins jobs to the single destination table. And finally, in case any bad write happens, we can use Delta Time Travel to roll back to a recent version and refill the data.
Now, after checking all the pipelines that ingest data from CI systems, let’s check out how we collect test to owner mapping. In Databricks, we use Bazel, which is a powerful and highly customizable build tools. Bazel allows us to see the metadata associated with any build target. For example, here is an example output of metadata associated with a build target meta test inside package four. Through this XML output, we can see that the build target is a ScalaTest target because this target’s class is generic scala test. This target will execute a suite, calm.databricks.MyTest. Also, the most relevant field for what we’re talking about right now is the owner’s field. It says the owner is a spark-env, which is a team name, which my team’s name. As Bazel can return this structure and information for every build target, we will be able to regularly query Bazel and populate a table that maps tests to their owners.
This diagram shows the current process of populating the test owner data table. Once the data pipeline starts, we first checkout needed repositories. Then you will use your Bazel queries to get XML output or build targets. With those build target definitions, we will power the XML record and convert them to a Spark data frame. Finally, we write the data to data table. If there is any existing record, we would use the record from the latest [inaudible] branch to update them. If there is no existing record, we simply insert the new records. For the test owner table, every record shows the test suite name and its corresponding Jira component of the owner. Since we used Jira to try the issues, the team name we saw in the previous slide was simply a pointer for us to find the actual Jira component. With this test to owner records, we will be able to tell who owns a test. In case we need other fields from Bazel build definitions, we can easily adjust this pipeline to include those new fields.
Finally, let’s look at how we consume test results data and test to owner mappings to report test failure to Jira. From test results to Jira failure reports, there are three steps. We first build a failure detector, whose job is to query test results table and load test failures that we want to monitor. Then this failure detector has interested test failures to test failure analyzer, whose job is to analyze failures to determine the type of failure. The type of failure will later influence how we report a failure to Jira. I will talk more about why we need this analyzer very soon. Finally, analyzed failures will get passed to a failure reporter. Failure reporter will turn the analyzed failure to our internal abstraction of a Jira ticket based on the failure type with a guide of test owners table. Then failure reporter will send the reports to JIRA.
Here, you will probably have a question about how it knows what failures have been reported. For this data pipeline, we have two mechanisms to help us avoid sending duplicates. First, when we report a test failure to Jira, we will query Jira to see whether the same test of the same build has been reported. However, it’s not efficient to frequently query Jira. Also, for analytics purpose, we need a mechanism to allow us to query what’s got reported to where. So we added test failure reports log table. After we send all report of fulfilled build to Jira, we’ll log which tickets a test suite of this build got reported to. With this table, failure detector also be able to filter out suites that have been reported. In a rare case, when the pipeline failed before it ran out the logs, the previous mechanism will still check Jira and allow us to avoid duplicate records. So now we have talked a lot on how we build out the automated data pipelines based on our first guiding principle.
Let’s look at how we report a failure to a Jira following our second and third guiding principle. So the second guiding principle is to connecting the problem was the right owner. Although we have test to owner mapping, we cannot fully rely on them to determine who should receive a failure report. Actually it’s quite common that test owner is not owner of certain problems. In order to direct the test failure to a right owner, we need to understand types of failures and how we should determine the owner of the failure for each type. The first type of failure are problems caused by a testing environment issues. For example, [inaudible] test that needs to launch a cluster failed to launch a cluster. For this type, sending the failure to test owner doesn’t help because test owner are often not the owner of the testing environment. Instead, we need to find the owner of the problem.
The second type of failure are those failure caused by other test failures. You will see this type of failure when tests are not run in isolated environments. For example, one integrated test somehow changed the required dataset used by another integrated test. For this type of failures, because they are not a root cause, they are just noise. And we don’t need to look at them. The last type of failure are failure that do not belong to type one or two. For those, the owner of the test should own the failure. And our failure analyzer will determine the type of a test failure based on the pattern in error messages and stacktraces. Then we’ll send failure in different types to the failure reporter. For type one failure, we’ll also attach the owner of the problem, as we will not look at the owner of the test.
Now, we have all relevant information available. We need to organize the information in a developer-friendly way so developers can consume them efficiently. In order to determine how we should organize different type of failures, or different type of information, we will need to look at how people will use those reports. In Databricks, we primarily have two use cases. The first use case is that we need to understand unique problems associated to a given team for a given time window. When engineers are on call, they need to get this high-level view to know what is going on and where to a start. Also, when we regularly reveal how they are operating, we want to see status related to test failures for teams for a given time window. The second use case is that when we start to debug a specific problem, we want to know more details of how best to investigate the problem. So common questions are how test failed, what test the environment is having problems, and when the test failed.
So basically, we need to have a detailed build to be able to see all relevant information to how best, quickly figure out the root cause. To serve these two use cases, we need to present test reports in JIRA at two levels. So we will use a two-layer reporting structure. We first create parent Jira tickets to a represent a unique problem. This parent tickets will group all failed instance of the same problem together. Then we’ll create subtasks to represent the individual failures happening in specific test environment. With this two layer structure, and new failed instance will get reported as a comment in the right open parent ticket and subtask.
Let’s see two examples. For example type 1 issue, that cloud provider cannot launch virtual machines because we exceeded the virtual machine quota, we will group all kinds of failure instance type, no matter what tests we are feeling and what versions are having this failure. For example type 3 issue, we will group failed instance of the same tests [inaudible]. Here it is FooBarSuite from different testing environment and Databricks are not working together. So here, I also want to point out that the two-layer reporting structure essentially allow us to group failures at two levels, and we can adjust how we want to group them as appropriate. For example, for type 1 failures, if we find that parent tickets should only group failures in a single testing environment together, we can easily make the change.
Of course, developer-friendly failure reporting is not just about how we group failures. We also need to take a advantage of features provided by the issue tracking system. As we use JIRA to track issues and track release blockers, we have the automation to automatically escalate failure that match certain criteria to blockers. We also automatically set affect version to test failure reports. So critical test failure can find it’s way to our release block, or dashboard. And team will see what test failure should get handled first. Finally, developer-friendly failure reporting is also about how to make developers to easily fix any owner attribution issue or any problems related to determining the type of problem or failure. In Databricks, developers can fix this issue easily [inaudible]. Once the change is revealed, emerged, it will get automatically deployed.
Okay. Those are all I like to talk about today. In summary, we talked about building automated data pipeline to manage test result at scale. Through my introduction of our data pipeline, I hope you’ll see how Databricks and Delta make the work easy to implement. We looked at how we connect test problem to the right owner and why it’s important. And finally, with discussed how we surface information from CI systems to Jira to make test failure reports easier to be consumed.
We still have more to do in this area. In particular, we want to start to build a holistic views of all CI/CD activities. With the data we have from our CI/CD systems, we will be able to gain insights to continuously drive the improvements of engineering productivity, product quality, and operational efficiency. If you find our work on this area interesting, and you want to get hands-on experience on it, I hope you can join us. Thank you everyone for your time. Have a great day and enjoy this year’s summit.

Yin Huai

Yin is a Staff Software Engineer at Databricks. His work focuses on designing and building Databricks Runtime container environment, and its associated testing and release infrastructures. Before join...
Read more