Runbot is a bespoke continuous integration (CI) solution developed specifically for Databricks' needs. Originally developed in 2019, Runbot incrementally replaces our aging Jenkins infrastructure with something more performant, scalable, and user friendly for both users and maintainers of the service. This blog post will explore the motivations behind developing Runbot, the core design decisions that went into it, and how we used it to greatly improve the experience of all the developers within the Databircks engineering organization.
Traditionally, Databricks has used Jenkins as the primary component of our CI pipeline that validates pull requests (PRs) before merging into master. Jenkins is a widely used and battle-tested piece of software, and has served Databricks well for many years. We had over time built up some ancillary services and infrastructure surrounding it:
However, over time Jenkins' weaknesses were becoming more apparent. Despite our best efforts of stabilizing and improving the user experience, we faced continual complaints about our Jenkins infrastructure in our internal surveys:
“Test infrastructure is frequently breaking.”
“Jenkins error message is not well presented.“
“The jenkins test results are very hard to explore”
“I would guess that weekly or biweekly I have to adjust my workflows or priorities to work around an issue with build infrastructure.”
Jenkins was becoming a constant thorn in the side of our development teams. Newbies continued to trip over the same known idiosyncrasies that are unfixed for years. Veteran developers continue to waste time in inefficient workflows. Our team tending to the CI system itself needed to spend time responding to outages, fielding questions due to the poor user experience, or managing chronic issues like flaky tests with inadequate tooling.
Despite many attempts over the years to improve it, this situation had remained largely unchanged. Weeks to months of work to try and e.g. better integrate our testing infrastructure with the Github Checks UI, or better manage resources usage on the Jenkins master, had proven unsuccessful. The whole experience remained clunky and unstable, and it was becoming clear that more effort expended in the same way would always have the same outcome. Working around Jenkins' limitations with auxiliary services and integrations was no longer enough, and we needed to try something different if we wanted to provide a significant improvement to the experience for those around the company using Databricks CI.
The issues with our Jenkins infrastructure could largely be boiled down to two main issues:
I will discuss each in turn.
Over the years, we had implemented many common CI features as a collection of separate services specialized to Databricks’ specific needs:
While Jenkins itself provides many of these features already, we often found we had our own needs or specific requirements that Jenkins and its plugins did not satisfy. We would have loved to not have to build and maintain all this stuff, but our hand was forced by the needs of the team and organization.
This sprawling complexity makes incremental improvement difficult, things such as:
In general, if you wanted to make a small change to the Databricks' CI experience, there just wasn't a place to "put things". Jenkins' own source code was external and non-trivial to modify for our needs, and none of the various services we had spun up were a good platform for implementing general-purpose CI-related features. Data, UIs, and business logic were scattered across the cluster of microservices and a web of tangled integrations between them.
Now, these issues weren't insurmountable, e.g., to get a multi-job-history dashboard, we ended up opening multiple browsers on our office-wall dashboard, positioned them painstakingly side by side, and then opened each one to a different Jenkins job:
Left for hours/days, Jenkins' "Blue Ocean" UI would sometimes mysteriously stop updating, and so we installed a chrome plugin to auto-refresh the browsers at an interval. Even then sometimes Jenkins' "Blue Ocean" web UI would somehow bring down the whole OSX operating system (!) if left open for too long with too many tabs (n > 4) forcing us to power-cycle the mac-mini running the dashboard! It worked, but it couldn't be said to have worked well.
Improving this experience with our current service architecture was very difficult.
Apart from our own menagerie of microservices, Jenkins itself is a sprawl of functionality. With multiple HTML UIs, multiple configuration subsystems, endless plugins, all with their own bugs and interacting in unintuitive ways. While this is all expected from a project grown over a decade and a half, it definitely levies a tax on anyone trying to understand how it all fits together.
Consider the multi-browser Jenkins dashboard shown above. Let's say we wanted to add a tooltip showing how long each job run was queued before being picked up by a worker? Or imagine we wanted to make the COMMIT column link back to the commit page on Github. Trivial to describe, but terrible to implement: do we fork Jenkins' to patch its Blue Ocean UI? Write a Chrome plugin that we ask everyone to install? Write a separate web service that pulls Jenkins' data through its API to render as HTML? All of these were bad options for someone just wanting to add a tooltip or a link!
Over the years we had made multiple attempts to improve the CI developer experience, with only marginal success. Essentially, attempts to add user-facing features to our CI system get bogged down in ETL data plumbing, microservice infrastructure deployment, and other details that dwarf the actual business logic of the feature we want to implement.
Apart from making it difficult to make forward progress, the existing architecture was problematic for us trying to manage the service operationally. In particular, Jenkins was an operational headache, causing frequent outages and instability, due to some of its fundamental properties:
This architecture has the following consequences:
These consequences caused us pain on a regular basis:
In experiments, we found Jenkins could manage about 100-200 worker instances before the stability of the master started deteriorating, independent of what those workers actually did. The failure modes were varied: thread explosions, heap explosions, ConnectionClosedExceptions
etc. Attempts to isolate the issue via monitoring, profiling, heap dumps, etc. had largely been unsuccessful.
As engineering grew, we found the Jenkins master falling over once every few days, always causing ongoing test runs to fail, and sometimes requiring a significant amount of manual effort to recover. These outages even occur at times when the Jenkins load was minimal (e.g. on weekends). Databricks’ bespoke integration also sometimes caused issues, e.g., the test explorer ETL job caused outages. As engineering continued to grow, we expected system stability to become ever more problematic as the load on the CI system increased further.
In finding a replacement for our Jenkins infrastructure, we had the following high-level goals:
In any plan, what you hope to do and accomplish is only half the story. Below are some of the explicit non-goals -- things we decided early on that we did not want to do accomplish:
Any CI system we ended up picking would need to do the following things:
Building your own bespoke CI system is a big project, and not one you want to do if you have any other alternatives. Before even considering this effort, we investigated a number of alternatives:
One common thread throughout the investigation of alternatives was the lack of "bare metal" EC2 support or bring-your-own-infra in most "modern" CI systems. They all assumed you were running inside containers, often their specific containers inside their specific infrastructure. As an organization which already had a good amount of cloud infrastructure we were perfectly happy with, running inside someone else's container inside someone else's cloud was a non-starter.
There were other alternatives that we did not investigate deeply: Concourse CI, CircleCI, Gitlab CICD, Buildkite, and many others. This is not due to any value judgement, but simply the reality of having to draw a line at some point. After digging into a half-dozen alternatives, some deeply, we felt we had a good sense for what was out there and what we wanted. Anyway, all we wanted was something to run bash commands on EC2 instances and show us logs in the browser; how hard could that be?
So that brings us to Runbot; the CI system we developed in house. At its core, Runbot is a traditional "three tier" web application, with a SQL database, stateless application server(s), and a website people can look at:
Apart from the backend system that manages CI workers to validate PRs and master commits and reporting statuses to Github, a large portion of Runbot's value is in its Web UI. This lets a user quickly figure out what's going on in the system, or what's going on with a particular job that they care about.
Written in Scala, Runbot has about ~7,000 lines of code all-included: database access, Web/API servers, cloud interactions, worker processes, HTML Web UI, etc. It has since grown to about ~10,000 lines with the addition of new features and use cases.
The technical design of Runbot can be summarized as "not Jenkins". All the issues we had with Jenkins, we strove hard to avoid having with Runbot.
Jenkins | Runbot |
XML file system datastore | PostgreSQL database |
Stateful Server | Stateless Application Servers |
Requires constant connection to workers | Tolerates intermittent connection to workers |
Extensible via plugins | Extensible by just changing its code |
Groovy config language | Jsonnet config language |
Groovy workflow language | No workflow language; it just runs your executable and you do your own thing |
Runbot is a much simpler system than Jenkins is: as mentioned earlier it started out around ~7,000 lines of Scala (smaller than java.util.regex!), and has grown modestly since then.
At its core, Runbot is just a traditional three-tier website. HTTP requests come in (GETs from user browsers or JSON POSTs from workers and API clients), we open a database transaction, do some queries, commit the transaction and return a response. Not unlike what someone may learn in an introductory web programming class.
On top of the common system architecture described above, Runbot does do a few notable things in order to support its featureset:
Despite being a distributed cluster manager with real-time pub/sub and a live-updating website, at its core Runbot works similarly to any other website you may have seen. We reuse the same database-backed HTTP/JSON web servers as a platform to overlay all other necessary domain-specific systems. This simplicity of implementation has definitely been a boon to maintenance and ease of operating and extending the system over time.
One part of Runbot's design that deserves special discussion is the worker lifecycle. While Runbot manages several elastic clusters of cloud EC2 instances, the actual logic involved is relatively naive. The lifecycle of a job-run and worker can be summarized as follows:
sha
for a post-merge master job, or pr
ID for a pre-merge PR job)maxWorkers
limit)initCommands
to initialize itself (e.g., cloning the git repo for the first time), and once done it subscribes for an unclaimed job run from the Runbot server.runCommands
to actually perform the job logic: checking out the relevant SHA/PR-branch using the parameters given with the queued job run, bazel build
or bazel test
or sbt testing
things as appropriate.runCommands
is complete, it runs some pre-configured cleanupCommands
to try and get the working directory ready to pick up another job run (bazel shutdown, git clean -xdf, etc.
), and re-subscribes with the Runbot server and either receives or waits for more work.timeouts.waiting
(typically configured to be 10 minutes), it calls sudo shutdown -P now
and terminates itself. We also have a scheduled background job on the Runbot server that regularly cleans up workers where shutdown fails (which happens!)The above description is intentionally simplified, and leaves out some details in the interest of brevity. Nonetheless, it gives a good overview of how Runbot manages its workers.
Some things worth noting about the above workflow include:
maxWorkers
, the frequency of running the spawn-instance background job, the timeout.waiting
. While not precise, it does give us plenty of knobs we can turn to make long-lived single-worker jobs or short-lived multi-worker jobs. timeout.waiting can be tweaked to tradeoff between idle workers and job-run queue times: longer worker idle times means there's more likely to be a waiting worker to pick up your job run immediately upon you queuing it.Runbot's worker management system can be thought of similar to "objects" in object-oriented programming: you "allocate" a worker by calling the EC2 API, invoke the "constructor" by running initCommands
and then call "methods" on it with parameters by running runCommands
. Like "objects", Runbot workers use internal mutability and data locality for performance, but try to preserve some invariants between runs.
It's almost a bit surprising how well Runbot's worker management works. "ask for instances when there's work, have them shut themselves down when there isn't" isn't going to win any awards for novelty. Nevertheless, it has struck a good balance of efficiency, utilization, simplicity, and understandability. While we have definitely encountered problems as usage has grown over the past two years, the naivete of its worker-management system isn't one of them.
Scalability and stability were not the only motivations for Runbot. The other half of the picture was to provide a vastly better user experience, based on the learnings from the Jenkins-related microservices that had organically grown over time.
Part of the goal of Runbot is information density: every inch of screen real estate should be something that a user of the system may want to see. Consider again the home page dashboard shown earlier: It's relatively straightforward page, with each job having a number of workers (small squares on the left) and a number of job runs (histogram of durations on the right), with workers and job runs both colored to show their status (yellow/grey = initializing, green = idle/success, blue = in progress, red/black = failed). At a glance we're able to draw many conclusions about the nature of the jobs we are looking at:
Even when things are going well, we can already notice some things. Why is compilation on Compile-MacOS-Master
flaky? Why was Compile-Master
broken for a while, and could that have been avoided?
But this ability is even more important when things are going wrong. Maybe a job started becoming super flaky, maybe a worker instance got corrupted and is in a bad state, maybe AWS is out of instances in that region and new EC2 workers cannot start. All of those are things can be often be seen at a glance, and every element on the page has tooltips and links to further information:
It's worth contrasting Runbot's main dashboard with the equivalent Jenkins UI:
While Jenkins theoretically has all the same data if you drill in, it generally requires a lot more drilling in to find the information you need. Something obvious at a glance on Runbot may take a dozen clicks and multiple side-by-side browser windows to notice on Jenkins. (Above I show the traditional UI, but the newer "Blue Ocean" UI isn't much better). For example, while at-a-glance Jenkins may show only the most basic of summary information, Runbot is able to show you the whole history of the job and its variation over time, giving a visceral feel for how a job is behaving.
Using Runbot, someone investigating an issue is thus able to quickly drill down to information about the job, about the job run, about the worker, all to aid them in figuring out what's going wrong. Front-end UI design is often overlooked when building these internal backend systems, but Runbot's UI is a core part of its value proposition and why people at Databricks like to use it.
By and large, Runbot's UI is a simple server-side HTML/CSS UI. There is hardly any interactiveness -- any configuration is done through updating and deploying the Jsonnet config files -- and what little interactiveness there is (starting, cancelling, and re-starting a job run) is implemented via simple HTML forms/buttons and POST endpoints. We have some custom logic to allow live-updating dashboards and streaming logs without a page refresh, but that's purely a progressive enhancement and the UI is perfectly usable without it.
One interesting consequence of this static HTML UI is how performant it is. The Github Actions team had published a blog post about how they used clever virtualization techniques to render large log files up to 50k lines. Runbot, with it's static server-side rendered logs UI (including server-side syntax highlighting), is able to render 50k lines without issue in a browser like Chrome, without any client-side Javascript at all!
Perhaps the larger benefit of static HTML is how simple it is for anyone to contribute. Virtually any programmer knows some HTML and CSS, and would be able to bang out some text and links and colored-rectangles if necessary. The same can't be said for modern Javascript frameworks, which while powerful, are deep and complex and require front-end expertise to set up and use correctly.
While many projects these days use a front-end framework of some sort, Runbot, being mostly static HTML, has its own benefits. We do not have any front-end specialists maintaining Runbot, and any software engineer is able to bang out a quick page querying the data they want and rendering it in a presentable format.
Since its introduction in 2019, we have been gradually offloading jobs from Jenkins and moving them to Runbot. Usage of Runbot has grown considerably, and it now processes around ~15,000 job runs a day.
This gradual migration has allowed us to reap benefits immediately. By moving the most high-load/high-importance jobs from Jenkins to Runbot, not only do those jobs get the improved UX and stability that Runbot provides, but those jobs left behind on Jenkins also benefit from the reduced load and improved stability. Right now we are at about 1/3 on Jenkins and 2/3 on Runbot, and while the old Jenkins master still has occasional issues, the reduced load has definitely given it a new lease on life. We expect to gradually move more jobs from Jenkins to Runbot as time progresses.
The growth of usage on Runbot has not been without issue. From the original plan of 50 concurrent workers, we're now approaching 500 concurrent workers with 15,000 CPU cores during peak hours:
While we performed load testing before the initial rollout to make sure things worked at the expected scale, we did end up hitting some speed bumps as the usage of our system grew 10x over the past two years. Given the Runbot application servers are stateless and easily replaceable, most of the problems ended up being with the Postgres RDS database which is a Single Point of Failure:
Despite being problems that caused outages, it is comforting to see that these issues are common problems shared by basically everyone who is operating a database-backed web service. Common problems have common solutions, which sure beats trying to untangle the interactions between Jenkins threadpool explosions, its multi-terabyte XML disk store, and our menagerie of custom microservices surrounding it.
Part of Runbot's sales pitch was how much it simplified development on our CI system. This has largely been borne out over the past two years it's been running.
We have had several individuals contribute useful features to the Runbot codebase, from smaller UI improvements like improved test-name truncation or more informative tooltips to bigger features like binary artifact storage or job run history search. We've had experienced folks join the company and immediately tweak the UI to smooth over some rough edges. We've had interns contribute significant features as a small part of their main internship project.
None of the individuals contributing these improvements were experts in the Runbot system, but because Runbot runs as a simple database-backed web application, they were able to find their way around without issue and modify Runbot to their liking. This ease of making improvements would have been unheard of with the old Jenkins infrastructure.
This has validated our original hope that we'd be able to make Runbot simple enough that anyone would be able to contribute just by writing some Scala/SQL/HTML/CSS. Runbot may not have the huge ecosystem of plugins that Jenkins has, but its ability to be easily patched to do whatever we want more than makes up for it!
So far we have talked about what we have done with Runbot in the past; what about the future? There are a few major efforts going into Runbot now and in future.
Runbot's usage has grown around 10x in the past two years, and we expect it to continue to grow rapidly in future. Databricks' engineering team continues to grow and, as we mentioned, will continue to port jobs off our old Jenkins master. Not just that, but as the company matures, people's expectations continue to grow: a level of reliability that was acceptable in 2019 Databricks is no longer acceptable in 2021.
This means we will have to continue streamlining our processes, hardening our system, and scaling out the Runbot system to handle ever more capacity. From the current scale of ~500 concurrent workers, we could expect that to double or triple over the next year, and Runbot needs to be able to handle that. Scaling database-backed web services is a well-trodden path, and we just need to execute on this to continue supporting Databricks' engineering as it grows.
One of Runbot's selling points was that as a fully-bespoke system, we can make it do whatever we want by changing its (relatively small) codebase. As the CI workflows within Databricks evolve, with new integration testing workflows and pre/post-merge workflows and flaky test management, we will need to adapt Runbot with new UI and new code paths to support these workflows.
One interesting example of this is using Runbot as a general-purpose job runner in restricted environments. While we're familiar with both Jenkins and Runbot, the fact that Runbot is such a small system means that it has a minimal attack surface area. No complex permissions system, no user account management, no UI for re-configuring jobs at runtime, a tiny codebase that's easily audited. That makes it much easier to be confident you can deploy it securely without any room for vulnerabilities.
Another use case may be building out an experience for managing the deployment validation pipeline. Currently per-commit Master tests, N-hourly integration tests, and others are handled in an ad-hoc manner when deciding whether or not a commit is safe to promote to staging and beyond. This is a place where Runbot's extensibility really shines, as we could build out any workflow imaginable on Runbot's simple web-application platform.
From its conception, Runbot was developed as a product managed and operated by the team that developed it. As time goes on, other teams are adopting it more and more, and the system is making the transition from a bespoke piece of infrastructure to a self-service commodity, like Github Actions or Travis CI.
Over the past two years we have already polished over a lot of the rough edges back from when Runbot was operated by the team that developed it, but there's a lot of work to do to turn it into an "appliance" that someone can operate without ever needing to peak under the hood to understand what's going on. The goal here would be to make Runbot as simple to use as Travis-CI or Github Actions, turning it from "internal infrastructure" to a "product" or "appliance" anyone can just pick up and start using without help.
In this blog post, we have discussed the motivation, design, and rollout of Databricks' Runbot CI system. As a bespoke CI system tailored to Databricks' specific requirements, Runbot has several advantages over our old Jenkins infrastructure: stability, simplicity, improved UX, and easier development and evolution over time.
Runbot is an interesting challenge: a mix of cloud infrastructure, distributed systems, full-stack web, front-end UI, and developer experience all mixed into one project. Despite the common adage about re-inventing the wheel, over the past two years Runbot has proven itself a worthy investment, cutting through a decade-and-half of organic growth to provide a streamlined CI system and experience that does exactly what we need and nothing more. Runbot has proved crucial to supporting Databricks as it has grown over the past two years, and we expect it will play a pivotal role in supporting Databricks as it continues to grow in future.
If you're interested in joining Databricks engineering, whether as a user or developer of our world class internal systems, we're hiring!