Applying DevOps to Databricks can be a daunting task. In this talk this will be broken down into bite size chunks. Common DevOps subject areas will be covered, including CI/CD (Continuous Integration/Continuous Deployment), IAC (Infrastructure as Code) and Build Agents.
We will explore how to apply DevOps to Databricks (in Azure), primarily using Azure DevOps tooling. As a lot of Spark/Databricks users are Python users, will will focus on the Databricks Rest API (using Python) to perform our tasks.
Anna Wykes: Hi, I’m Anna. Welcome to my talk on DevOps for Databricks. I am a Data Engineering Consultant for Advancing Analytics. Our agenda, what we’re going to do. We’re going to look at what is DevOps. We’re going to look at CI/CD, Continuous Integration/Continuous Deployment, IAC, so Infrastructure as Code, Build Agents, Databricks Rest API, a real world example, and then some other tooling examples.
So what is DevOps? We’ve got our BI developer, our data scientist, our data engineer, our software engineer, all data professionals. Our BI developer wants to get the dashboard published on a website. Our software engineer wants to update the website with the latest dashboard. Our data scientist wants to productionize models and have them automatically update. Our data engineer wants to push the latest ETL pipelines to production. Now essentially, this all boils down to DevOps. We all want to get their product, their solution into the real world.
So DevOps is essentially that code, that solution that we then… We build it, we test it, and we release it, and we deploy it. So essentially, DevOps is that process and that’s that figure of eight we can see here essentially of creating that solution, testing it, publishing it, and going around in that figure of eight over and over again, working together to essentially deploy our solutions. To do that, we have pipelines, we have DevOps pipelines. We have the development stage, and then testing, and then production. What tools do we use to actually achieve this? We’ve got two tools, we’ve got Continuous Integration/Continuous Deployment, and Infrastructure as Code.
On the Continuous Integration/Continuous Deployment, we can use Azure DevOps, CircleCI, Jenkins, Octopus Deploy, GitLab, and Infrastructure as Code. You’ve got ARM Templates in Azure, Terraform, Pulumi, and Azure Bicep. There’s many, many more options, but these are some of the most commonly used tools. Continuous Integration and Continuous Deployment, CI/CD. Continues improvements essentially feature releases, fast bug fixes, ability to quickly rollback. That’s what CI/CD gives us. It gives us the ability to test our codes within those pipelines, so unit testing, integration testing, end to end testing. We can also perform linting within those pipelines as well, so making sure that all code is in a nice format and conforms to our standards.
Infrastructure as Code, IAC. Infrastructure as Code is essentially the blueprint of your solution. It’s essentially writing down how you want your solution to be structured, what services you want to use, what database is. For example, what data lake storage solution you want to use. That’s all defined in your Infrastructure as Code.
Build agents. What is a Build Agent? It is the compute under your DevOps pipelines. We’ve already talked about pipelines. A build agent is that compute under the herd. This is also out of the box available agents in tooling, and DevOps tooling such Azure DevOps. But you can create your own custom VM agent. You can do that with a VM agent, or you could also create a custom Docker agent to perform the same tasks.
So in your pipeline, you can define some YAML, and that’s what we’re going to be looking at in about. Then your agent essentially lives under the herds. That will pick up the YAML when you trigger it and then process all the steps in your YAML. Then your pipeline, essentially, it gets executed on that agent. So why would you use a custom build agent? You can decide specifically what you want to use your code, how you want to use that, and what you want it to run on. So whether you want a Linux machine, a Windows machine, what versions of your operating systems you want, what Docker image, you can make sure all the tooling that you need is already installed on those machines, so you don’t have to put that into your pipeline. You don’t have to install specific packages. I’d be timing your pipeline for example, and you can make sure all the tools you need essentially, or all that on that agent. It can keep your state. It can also remember that it’ll be in a virtual network and that’s super important with companies with clients who want to keep everything secure.
Databricks REST API. We’re going to be looking at examples of the Databricks REST API, but what exactly is that? Databricks REST API allows you to perform lots of essential database tasks. If you do use the REST already, then you can use that, your existing REST knowledge out of the box, start working with it straight away. So it’s easy to pick up. Because it’s REST you can also use your own language against it. So we’re going to be using Python because essentially within the world of Databricks and Spark, there are lots and lots and lots of Python users, so it makes sense that we write our code in Python. Essentially, what we’re going to be doing using the REST API is cross platform. So I will be demoing it Azure, but that doesn’t mean you have to use Azure. You can use the same principles with any Cloud provider.
Our real world example. What are we going to do? We’re going to use Python scripts and the Databricks REST API to create a Databricks cluster, check cluster status, upload notebooks to the Databricks workspace, run some tests against our Python code, build and upload a Python wheel to Databricks, install/uninstall/update Python wheels in Databricks. Then we’re going to use Azure DevOps to run our scripts. So we’ll have a YAML pipeline as mentioned before, and then we’ll quickly look at the custom DevOps agent that I’ve set out to actually run these pipelines.
Okay. So here we’ve got a Databricks instance. So this is the Databricks instance that I’ve created in Azure. We can see here, I’m looking at the cluster. Click on this tab here. I’ve got a basic notebook that I created. If you look down here at what cluster is available, we’ve got one that I’ve called manual-cluster because I’ve come in manually and created it. But we’re going to create our own using the REST API using our scripts. If we go into our cluster here, you can see essentially there are currently no libraries. Here, we moved into Visual Studio Code. Now in here, I have my own solution here where I filter all of my scripts including a very basic Python wheel, which incorporates just one task that we’re going to look at as well. Then we’ve also got our pipeline and our notebook.
Firstly, we’ll look at the pipeline scripts. What do we want to do first? We want to actually create our cluster. What is this script doing? In these scripts and in all of the scripts, what we are firstly doing is actually authenticating. So you need to authenticate with your REST API, and nobody done this. You said she got some tokens that you then set up and work with to actually then be able to work with the REST API. Then I’ve got essentially, initially this method here, which is create cluster, and this is doing exactly what it says on the tin. It is essentially communicating with that REST API and creating us a cluster, and that’s essentially what this JSON is here. We’re just defining what that cluster is going to be. So let’s actually [inaudible].
Okay, so it’s [pretender’s] cluster ID, which is so and so. But now, we’ve got this “Cluster is Pending Start.” Why is he doing that? I have deliberately created this method here, which as soon as we run this method, this create cluster, we’ve got this manage Databricks cluster method. What this is doing is it’s polling the REST API until it gets a status against that cluster, that it can actually [reverse] the way that it gets a cluster started status, or it might, if we’re unlucky, gets “the cluster have failed to start.”
So we are simply in here. We’ve literally got a while loop that is just waiting for a relevant status that it can work with. So, if we were to leave this code here as it is, this terminal, it would essentially keep printing. That cluster is pending until it gets a cluster is successfully started, or cluster has failed to start. For the sake of the demo, we’re going to cancel out of that. That’s not going to hurt the cluster in any way because we were literally just polling to see the status of it. But if we go into one notebook… Yeah, let me just do a quick refresh.
Now, before we just had the manual cluster, but now, we’ve got this DevOps’s cluster as well. We can see that that it is a mistake. So that circle is going around and around up. We are now having a DevOps cluster that we have created via scripts that is now starting up. So what do you want to do now? Well, we’ve got a cluster, so now we want to have a go at actually uploading some notebooks. So scripts to upload notebooks to dbx. Here is what it says on the tin. Again, we’re authenticating, and then we literally have some Python code here, which is working with the REST API to look inside the folder in here. Grab the notebook in there and upload it. If we were seeing…. when that script. We get a response of 200, that’s what we want.
We’re going to our Workspace. Up in here, we can see [research] you have with DevOps, a notebook here that essentially has been uploaded via our script. What do we want to do now? We’ve looked at notebooks, we’ve looked at the wheel. There’s a lot more scripts in here that wrap around that, essentially that what we see in the pipeline in a minute. But what we want to look at now is the wheel that we’ve gone through. We’ve got a really basic Python wheel. Now, a Python wheel is a Python package essentially. It’s a library that you can use the same as you would any of the library in Python. It’s just one that you’ve created yourself. So we’ve got a wheel here, really basic, and it’s got some tests in here, one test specifically. This test, we want to run in our pipeline, but firstly, let’s see it running locally.
It’s using pytest, which is a testing package in Python. All right. Okay, we can see that we’ve got one passed test. So we can see this test is working. It’s just testing some really demo code that I’ve got on here. So we’ve got a test there and essentially, we can run that in our pipeline. But firstly, what we want to do is we want to actually upload that wheel into the actual cluster. What we’ll do is do… What we want to do is then run that. We’re going to upload our wheel Databricks. Essentially, just this script to upload the wheel to DBFS. We’ve got a check wheel status here, which does exactly what the polling did when it was checking the status of the cluster, does exactly the same thing. It would just poll and wait to see that the wheel has installed correctly. So does exactly the same thing as before.
There we go into our cluster [inaudible] a minute. We have to refresh the page. That’s it. We’re uploading it. Sorry. Now, we need to do an install. There we go. Right, there we are. There’s our wheel. We’ve uploaded it by our script and it has the status of installing. It’s exactly what we want. So we’ve seen the testing for the wheel. We’ve seen uploading the wheel itself. We’ve seen how we can create a cluster and poll against [inaudible] status. We’ve uploaded our notebooks. Now we want to actually see this in the context of DevOps because we were just doing it all locally. We want to actually see it working in Azure DevOps itself.
Okay. So here we are now in Azure DevOps. We can see here, I’ve already created a pipeline, a DevOps for Databricks pipeline. If we just dig into that, then we go into “Edit”, we can see our YAML. This YAML is within the [repository] we were just looking out. So we then [inaudible] page. It was in the folder there as well. It was in our source control, and we just hook it up essentially within Azure DevOps to a pipeline. To do that, super simple. You just go into “Pipelines”, “New pipeline”. You select Azure Git Repos. You say it’s an existing YAML file you created and you can just have that way. Or if you wanted to, you can get Azure to do it for you, Azure DevOps to do for you at this stage. You can say, just create me a new one in this Git repository and it will do.
So we’re going into our pipeline, going to editor again. Here, we’ve essentially got initially a load of stuff to set up our pipeline. We’re seeing what branch are we interested in, what color we’re working with. Now, that’s our custom agent, and I’ll show you that in a minute, the custom agent that I’ve actually set up. If you weren’t making the custom agent, you’d actually specify just a normal out of the box as your agent, and it would just simply just go away and find you one every time this runs. Variables. So I’ve got a variable grid essentially, sitting around in the background that’s got a lot of use for, variables in there that I use throughout the pipeline and they may actually go into our jobs. So this is where we are actually defining tasks that are essentially going to run our scripts.
Initially, we’ve got some setups, stages. We’ve got setting up Azure Key Vault, installing some Python packages. We do all our authentication. Now this bit, I’ve come to start. This is literally while we are creating our cluster and then doing that polling, and because that takes quite a long time, I’ve come to [inaudible] the demo, but normally, we’d have this scenario as well and it would do that polling that we saw before. Then we’ve got our second job here, which is upload notebooks. Literally, if we scroll down to this task here, we are specifying the script path. So that script that we’ve been looking at before, the upload notebooks to dbx, we are saying, this is where my script is, run it as this task. It will essentially go away. Find that script. We pipe in all of that, environment variables that we need, so those tokens I was talking about for the REST API. We set those up earlier in this pipeline and then we pipe them in so we can use them in the script.
You tell it what directory we’re in so it can actually find the relevant files and we’re away. Essentially, we’ve got a whole load of these tasks in here that are going away, setting all this up. I’m running our script in a linear order, in a sequential order. So initially here, we’re uploading our notebook and then we’re going to upload our wheel, install the Python packages again, going to run our tests here. I’ll show you in a second what the output of this is. So we’re doing our pytest. Essentially, that will go away, find the test, and then we’ve got some extra parameters here, which actually makes sure that these tests are visible to Azure DevOps. Then you can actually stop the pipeline accordingly if the test fail and we can actually see those tests in a nice [inaudible].
So yeah, so we’ve published our test results, and then here’s our actually calling our scripts. So upload wheel to DBFS. So we are uploading it into DBFS in Databricks, so it’s somewhere that Databricks can see it. Then we uninstall the wheel, and this is because we need to uninstall it before we really install it. So we do an uninstall. It makes you then put on an install script. When you uninstall a wheel, you need to restart the cluster so that we start the cluster script that we call. Then we essentially install the wheel. So that’s the script that we ran before we install wheel script. Then we’ve got our check real status. Like I said, that’s just going to poll and wait for that wheel to have a status of either successful or a failure.
So where are those test results? If we just step out of this here, we go into our job. You see here, we’ve got a menu here, summary, and that’s basically all about our steps to create your Azure pipeline. We also got this tests tab. This is here because we were actually outputting these test results. What we can see here, we’ve got total test of one, and it’s telling us that that test actually passed. If it had failed, essentially it would have stopped the pipeline for us and this would be bad. That’s super useful if you’ve got a suite of unit tests in there that you can actually essentially make your code really, really best and make sure that nothing gets pushed into your tests or development environments that don’t perform to essentially your unit test, your [blueprint] there.
So that’s for YAML and that’s our tests. What about this agent that I’m talking about, this custom agent? If we go into our project settings and down here in the pipeline to our Agent pools. I’ve set up this DevOps for Databricks pool. So an agent pool is essentially a pool of agents that you set up that could be a VM or maybe a Docker container that you’ve got running in your infrastructure. In here, if we go to agents… You can see here, I’ve set up a DevOps with Databricks agent. This is simply a VM within Azure, literally a VM that I’ve set up in Azure. I have installed the relevant components on it so that it can become an agent and connected it to this particular Azure DevOps instance. If we tap to the jobs here, we see that pipeline that we’ve gotten, all the times it’s run. So it’s logging all of that against our agent, all the processes that I’ve run against it.
Okay. So now we’re going to look at examples of other DevOps on Azure tool. So what else can we use other than Azure DevOps, other than in working with the best API for Python? Databricks, sorry. So other IAC tools. We can use Terraform. We can use Azure ARM Templates. We can use Pulumi. We could use Visual Bicep, and essentially many, many more. What is Terraform. So Terraform is a super popular tool, but what is it? Terraform is an open-source Infrastructure as code software tool that provides a consistent CLI workflow to manage hundreds of cloud services.
Terraform codifies cloud API into declarative configuration files. So Terraform as a concept of write, plan, and apply. Write, essentially, is you’re writing your instructor’s code or blueprint. You can plan it so you can literally hit plans. See what’s actually going to happen if you deploy those changes, if you create those resources. So that’s great as a safety net if you’re going to suddenly blow something away that you actually want to keep in your infrastructure, you’re going to change something that shouldn’t be changed. Then you can hit apply and it pushes all that up into your infrastructure, into your cloud solution.
So this is essentially a diagram taken upon Terraform’s own documentation to how to infrastructure it’s code in DevOps Databricks. You can look at that documentation. You can find this diagram and it just literally illustrates all the different components that you can work with and how that hooks into AWS and Azure. This is just some example, Terraform, again, taken from their documentation where we’re setting up our database connection specifying that we are going to be uploading some notebooks, specifying a bit of what we like about the job. So that’s on the left-hand side. On the right-hand side, that’s what’s actually carrying out those tasks.
What is Pulumi? A cloud engineering for everyone. Build, deploy, and manage modern cloud applications and infrastructure using familiar language, tools, and engineering practices. Essentially in Pulumi, you can write your Infrastructure as Code in Python, in TypeScript, in Go, or the C#. So if any of those are languages that you’re really familiar with, it makes Infrastructure as Code a lot more accessible because you’re writing it in something you already know. So just a quick example of Pulumi Azure Databricks module. It’s based on the AzureRM Terraform Provider. You just got some example of Pulumi here. This is our Python flavor, this is our Python developers. It’s just an example of why we’re setting up a workspace essentially with Pulumi. The great thing with Pulumi is because you’re writing in Python, if there’s any features, function that is missing in there that you want to do with your Infrastructure as Code, you can literally just for example, start working with the REST API as we have done in this demonstration. You can actually just write that in and have it jumping in and out of the image to do your Infrastructure as Code essentially.
What is Bicep? Project Bicep is the next generation of ARM Templates. Now, if you’ve worked with ARM Templates, you know for sure they can get really complex and they can get quite confusing to understand. Bicep is just the next generation of that way of more easily defining your Infrastructure as Code. In your DevOps, Bicep is cleaner where we do language that gets compiled into ARM when it’s deployed. So it still creates arm under the head, but essentially, you’ve got an easier Bicep language over the top. So you write and compile your Bicep language, then that compiles into your ARM Templates, goes to Azure Resource Manager, and then that’s deployed to your solution.
Finally, our summary. DevOps is for everyone. CI/CD keeps your code in check and the latest features and changes in production as soon as possible. IAC infrastructure code is the blueprint of your solution. Lots of tooling options and Databricks REST API can be used in conjunction with Python and Azure DevOps to create effective fault-tolerant pipelines.
Anna is a veteran software & data engineer, with over 15 years of experience. She’s tackled projects from real-time analytics with Scala & Kafka, building out Data Lakes with spark and applying engi...