by Ka-Hing Cheung and Vaibhav Sethi
Thousands of Databricks customers have adopted Databricks Repos since its public preview and have standardized on it for their development and production workflows. Today, we are happy to announce that Databricks Repos is now generally available.
Databricks Repos was created to solve a persistent problem for data teams: most tools used by data engineering/machine learning practitioners offer poor or no integration with Git version control systems, forcing them to navigate through multiple files, steps and UIs to simply review and commit code. Not only is this time-consuming, but it's also error-prone.
Repos solves this problem by providing repository-level integration with all popular Git providers directly within Databricks, enabling data practitioners to easily create new or clone existing Git repositories, perform Git operations and follow development best practices.
With Databricks Repos, you get access to familiar Git functionality, including the ability to manage branches, pull remote changes and visually inspect outstanding changes before committing them so that you can easily follow Git-based development workflows. Furthermore, Repos supports a wide range of Git providers, including Github, Bitbucket, Gitlab and Microsoft Azure DevOps, as well as provides a set of APIs for integration with CI/CD systems.
We are also excited to announce new functionality in Repos that allows you to work with non-notebook files, such as Python source files, library files, config files, environment specification files and small data files in Databricks. This feature, called Files in Repos, helps with easy code reuse and automation of environment management and deployments. Users can import (or clone), read, and edit these files within a Databricks Repo just like in any local filesystem. It is now available in a public preview.
Files in Repos provides you a simplified and standards-compliant development experience. Let's take a look at how this helps with some of the common development workflows:
Python and R modules can be placed in Repos and notebooks in that Repo can reference their functions with the 'import' statement. You no longer have to create new notebooks for each Python function you reference, or package your module (as a whl for python) and install it as a cluster library. Files in Repos helps you replace all of these steps (and more) with a single line of code.
In summary, with Databricks, data teams no longer need to build ad-hoc processes for version control and productionize their code. Databricks Repos enables data teams to automate Git operations, allowing tighter integration with established CI/CD pipelines of the company. The new Files feature in Repos enables importing libraries for code portability, versioning environment specification files and working with small data files.
Repos is now generally available. To get started, click on the 'Repos' button in your sidebar or use the Repos API.
Files in Repos feature is in Public Preview and can be enabled for Databricks Workspaces! To enable it, go to Admin Panel -> Advanced and click the "Enable" button next to "Files in Repos." Learn more in our developer documentation.
To discover how Databricks simplifies development for data teams by enabling automation at each step of the ML lifecycle check out this on-demand webinar with Databricks architect Rafi Kuralisnik.