Often times model deployment and integration consists of several moving parts that require intricate steps woven together. Automating this pipeline and feedback loop can be incredibly challenging, especially in lieu of varying model development techniques. MLflow and the model registry can act as powerful tools to simply building a robust CI/CD pattern for any given model In this talk we will explore how MLflow- specifically the model registry – can be integrated with continuous integration, continuous development, and continuous deployment tools. We’ll walk though an end to end example of designing a CI/CD process for a model deployment and implementing with MLflow and automation tools
– Good morning, good afternoon, or good evening everybody. My name is Mary grace Moesta and I’m here with Pete Tamsin. And we’re gonna walk through our talk today, which is all about Productionalizing Models Through both CI/CD Design and MLflow.
So before we get into any of the content,
we’ll just do quick introductions. So, as I said my name is Mary Grace Moesta I’m a Customer Success Engineer here at Databricks, and I’m mainly supporting customers in the retail and CPG space. In a previous life I was a data scientist. So a lot of my work was focused on customer experience and brand acceleration within the retail and CPG space. Additionally, I am a contributor to Databricks Labs product called AutoML. And I’m also based in the city of Detroit. And like to run and golf when I can feel myself away from my computer. – Hi everyone, my name’s Peter Tamsin. I’m a Technical Lead on the Customer Success Team. I am based out of Atlanta, Georgia, where I live there with my wife and four kids. I have been in the data space for about 20 years. And at Databricks, I work a lot with our Databricks Automation SME group, where I’ve contributed to multiple blogs and published a couple of best practice guides specifically on DevOps and CI/CD. – So the agenda today, first we’ll talk through some definitions and assumptions. So we’ll define what MLOps is kind of in a definition that we’ve seen. And I think is pretty accurate for what we’ve seen in the field. We’ll talk through the importance of MLOps in a production system. We’ll go through the basics of CI/CD. So we can kind of level set with these base definitions of hey, here’s the CI/CD process. And then we’ll kind of pivot those CI/CD basics specifically for machine learning and talk about how they transition into, end-to-end machine learning project. And after we go through these definitions and assumptions, we’re actually gonna walk through an example of promoting machine learning code and the model itself as an artifact through CI/CD pipeline. So we’ll talk through the specifics of version control, how to interface with MLflow and the MLflow client, and how to register the model using the MLflow Model Registry. And then we’ll go through the specifics of building an Azure DevOps pipeline. That’s gonna trigger production runs of both training and inference for this given model.
So starting with this definition of CI/CD, this is from a paper, but it basically defines Continuous Delivery for Machine Learning is a software engineering approach in which a cross functional team produces machine learning applications based on code, data, and models in small and safe increments, that can be reproduced and reliably released at any time in short adaptation cycles. And so I know this is a little bit of a mouthful, but in terms of, like in any good machine learning product, you wanna set your definitions and assumptions early. And so I’m defining this Continuous Delivery for Machine Learning which kind of sets the stage for the further content that we’ll walk through. And the emphasis here that I kind of wanna just draw attention to is this concept of having these small and safe increments of adapting your models in these short adaptation cycles. So it’s all about being able to iterate quickly and safely, especially in these cloud environments and you end-to-end machine learning pipelines.
So why MLOps is relevant? If we start kind of from the beginning the data science and machine learning development framework is traditionally centered around local development. Which means that work is scoped to a data scientist local laptop, code is saved in version locally, and folks are limited to the amount of compute and memory within your local machine. And so that breaks down pretty quickly, especially if you’re starting to do these projects on these massive sets of data that aren’t gonna fit, within your local machine. And that’s where the cloud comes in handy, right. So as data and processes as their complexity grows, so does those number of integration points. So if you’re moving from this local development experience to the cloud, and so you’re operating in a distributed system when you’re leveraging tools like Spark, that means that there’s gonna be more machines to manage your data is stored across various locations. You have all the permissioning and authorization that needs to happen to access all these integration points. And it can keep building and building complexity. So the cost of taking off a single run now becomes much more complex and expensive. So that safety concept that I talked about in that quote in the previous slide is really about like this complexity and expense that I’m scaling up on machine learning process looks like in the cloud. Kicking off a run on your full dataset can mean putting a significant amount of machines with a significant amount of CPU, which definitely comes at a cost. And so machine learning operations allows for development at scale and hands-off execution of production runs. So if we think about the whole purpose of doing machine learning projects in the data science community is being able to serve these to the business. And so MLOps is important in this factor of your end-to-end machine learning process. Because it allows for not only development to happen at scale. So these quick iterations to happen, but this hands-off execution allows the business to consume these solutions much easier and much quicker. Which again is getting to that ultimate goal with a much more smoother pipeline. So we’ve gone through the definitions of kind of what we view as MLOps kind of talked about why it’s relevant and then Pete’s gonna go through the basics of CI/CD.
– Thanks, Mary Grace. So yeah, I’d like to start off by talking about traditional CI/CD. So CI/CD is a process that’s been around for decades really in traditional software engineering And it’s becoming increasingly more important within the realm of data engineering and data science. Reason being is, in order for the content that’s actually generated by these pipelines to be valuable. It has to be timely and it has to be accurate. And so automating this process assists with that and ensures that pipelines are running efficiently and correctly. And so really what I’d like to first start off is talking about is some of the steps that are traditionally involved in a CI/CD process. And so we’ll start off with the CI, which is Continuous Integration. These are all steps that are commonly done manually if you are developing things, but we’re just wrapping automation around it. So you develop your code in a notebook or in an IDE that you’re developing. You’d run some, maybe some manual unit tests on it. And then once you’re actually ready to actually have that merged in with other people’s code, you would commit that code into a source-code repository. At that point, the CI/CD process would initiate a build and it would take all the code that’s been checked in. And if it needs to be compiled, it will compile it. But basically it’s just assembling all the different code, pieces of code and that you need to put into an artifact that will eventually be deployed to your target environment. And so during the build process, we’re doing things like a more automated unit testing that’s there where gathering dependencies. And then optionally, you might be doing some compiling in this stage. And then finally, once it’s all put together, you’re building a release, which is taking all those separate parts and putting them together into an artifact that can eventually be deployed into a target environment. And so that’s typically the steps that are evolving in the Continuous Integration. So this is often decoupled from Continuous Delivery because the rate by which you might be wanting to automate this process might be multiple times a day. Whereas the Continuous Delivery part, where you’re actually deploying it to a target environment, maybe on a different schedule. And so during this part of the Continuous Delivery, you’re taking that artifact that you built during the Continuous Integration phase, and you are deploying that to the appropriate environments. If you have a DDL scripts that need to go to a database you’d be connecting to that database and executing them if there’s code that needs to be pushed up to a web server or a notebooks that needs to be deployed to another workspace. These are all the steps that are typically involved in the Continuous Deployment or Continuous Delivery phase. Once they’re actually deployed there, you’ll wanna run further tests. And since everything’s all been put together, these are more integration tests that take place once all of component parts have now been put together. And then finally, once it passes all these tests and it’s promoted to the appropriate environment, you’ll eventually want to schedule these jobs, in the case of data engineering or data science pipeline. You’d maybe wanna schedule these jobs and then monitor them as they run. And really then that feedback would be sent back to developers and then the process starts all over again. – Awesome, thanks Pete. So now that we’ve kind of talked about the basics of CI/CD, let’s take those and kind of lay them over what this would look like in an actual machine learning project. So Pete talked about the Continuous Integration side, so when it comes to code, so what this looks like in ML experiment would be using MLflow to track your experiment runs tracking your hyperparameters, tracking changes in your code and any artifacts that come out of these runs. And again, this would still happen in a notebook in IDE environment. So you’d be developing on a feature branch within your version, get repo using your favorite ML tools. So that’s the sklearn, SparkMLs, PyTorchs of the world. And so when it comes to the actual build process, what that looks like in ML pipeline is… Sorry can you oh thank you. Is your training runs are gonna happen at scale, with your new model features on new hyperparameters that you’ve implemented in the changes in your code. And then tracking the different model versions in production using Model Registry. So tracking all of those different builds within the Model Registry as that central sort of governance location. And then lastly, on the Continuous Integration side release, as Pete mentioned, there’s like a release artifact in a typical CI/CD pipeline. But a release in an ML product could be a whole slew of things. It could be a model. It could be an entire pipeline model, could be certain images, could be code itself. There’s a lot of things that are considered as artifacts. And that’s gonna be really dependent on the problem that you’re solving and what the downstream business needs to consume off of this specific ML project.
So on the Continuous Delivery side, the deploy portion of that. So what that looks like in machine learning is there’s a few different methods. There’s traditional batch inference and batch scoring real time serving. So serving as like a REST endpoint using containers, like Docker or Azure ML Containers or using cloud inference services. So that’s like the Sagemakers Azure MLs of the world. So deployment again, it’s gonna be defined by your business problem and what the business needs. Is the SLA such that, this job or this machine learning model is run overnight, or is it such that when a customer interfaces with this model it needs a latency of milliseconds for response time. So again, that deploy is all gonna be dependent on like the business question that your model is wrapped around. Test is pretty consistent. So you should in an ideal world, maybe running tests for all your machine learning code and feature engineering, to make sure that as you’re delivering this, you’re not breaking anything else downstream.
And then lastly for operate, this is where again, you can leverage tools like Jenkins and Azure DevOps, et cetera, to trigger a new model builds based on any trigger that you wanted to find. So we’ll walk through an example in the back half of this that actually uses Azure DevOps, that’s going to trigger based on any changes that are committed to your master branch and GitHub. But again the same concept works for Jenkins, and the trigger doesn’t necessarily need to be a change to a repo. You could use webhooks to actually trigger it, to be dependent on any changes within the Model Registry or any other triggers that are defined within your machine learning process.
So how does MLflow contribute to this? So if you look on the left side, we’ve got this zoo of tools per say. There’s several different frameworks that can be used in the actual code part of Continuous Integration. There’s a ton of different packages, modeling, formats that can be used depending on your business problem. And so I’m hopefully it was actually gonna be that governing body that kind of stretches across all of those CI and CD concepts we discovered to provide consistent views across your development and production environments. So MLflow models is the way that MLflow is able to package things up and bring models into a reproducible format. So what you’re building in your code in your Continuous Integration side is consistent and going to be able to be deployed on whatever your deployment needs are. And then again, more on the like code development side, the MLflow Tracking Server is gonna be the way to develop your models and track all of these different feature branches and all of these changes that are happening within your model process. And then the registry comes into play and when it comes to tracking these releases and deployments over time. So again, it’s that governance layer that’s going to provide that audit path of, hey, here’s what model was in production when here’s who promoted it, here’s when it moved out of production. And you have that clear trail for an audit path for your model. And then again, the way the Mlflow packages everything up using this Mlflow packaging format allows for through deployment on this other suited tools that’s on the far right end here. And so because as we talked about kind of in the previous slide, there’s a lot of concepts in the CI/CD flow for machine learning that are dependent on your business problem. Which is gonna affect the packages and libraries you use, which is going to affect the deployment methods. And MLflow was really that consistent piece that’s able to keep all of that together, to provide consistency across your production and development environments.
So we’re now gonna move through an example of what this actually looks like. And kind of in the process that I’ve gone through for all of this is, I have read a lot of stuff about MLOps and kind of walking through. It was really the most helpful thing to really understand what this process is like. So in this case, like I said, we did this all through Azure DevOps, but you can pick your favorite version control, whether it’s Github, AzDo, Bitbucket. And the typical pattern we see is you have a master repo and you’re gonna branch based on your different features or changes in your feature set or hyperparameters algorithm refinements et cetera. So if I’m a data scientist and I wanna try an ensemble method on the current model. I would branch from my master repo name my branch ensemble development and kickoff and start going crazy. And that’s where MLflow Tracking comes in. So that’s where those metrics and all of that criteria can be tracked in the MLflow Tracking Server. So as I’m developing this new ensemble refinement to my model, I have a good picture of what I’ve tried. What’s worked, what hasn’t, and what is being deemed successful depending on those measures of variants that I’m using. And then lastly MLflow is gonna be the place to track any additional artifacts that will be used in the downstream build and release stages. So that screenshot that’s on the right hand side there is just what an example of what tracking the artifact looks like.
In this case, it’s a model that has a couple of different flavors to it.
But again, this is the way that we can track those artifacts that we mentioned in that build and release stage for CI. So if we’re thinking about that classic CI/CD pipeline that Pete talked about a little bit earlier, we’re in that code phase right now. Of hey, we’re developing our code on our new branch and testing it to see what’s works and what doesn’t before we push to our master branch.
So next is actually controlling the model flow through those build and release stages. So say that I have trained a new model with a new feature set or a new ensemble model, and I’m ready to promote it to master and promote it to my production environment. And so the way that this works is actually, so this code interfaces with the MLflow APIs. So this first cell up here is actually parameterizing your Databricks notebook. So this again, can be a hands off and deployment. So it’s parameterizing, pointing it to the right path rather of hey, where to pick up my model, where to pick up my artifacts. In the next box is actually setting the decision criteria first for the best runs. So I have all my runs sitting in the tracking server. I know hey, I wanna pick the run that has the highest RMSE. And so I wanna do that, be able to do that programmatically. If that’s gonna be my success criteria for this model, I may wanna go ahead and pick it up. So this cell is just setting that decision criteria for the best run. And then this last cell here is actually searching through all of these filtered runs to identify the best run but also programmatically build that model URI that’s gonna be referenced back later. That unique identifier that will use an automation moving forward.
So again, as we continue to control the model flow through the build and release stages, we’ve now developed this new model. We have selected the best run based on our decision criteria. And now we’re gonna use the Model Registry to actually track the flow of the models in and out of production. And note that, I think this is kind of a common confusion sometimes with Model Registry, is that the stages that are defined in the registry do not directly translate to environments. So the stages within Model Registry are staging, production and I think Dev. Those aren’t necessarily mapped to different environments. That’s just the tag stages within the registry itself. And so if you have multiple workspaces, the registry currently always spans one workspace. So if you’re promoting to these different environments that are isolated at a workspace level, you’ll have to add basically extra step here, that’s going to recreate the registry in your environments.
So what this code is doing at a really base level is it’s initially registering this new model saying, What if selecting my best run. I wanna register it as the one that’s gonna be targeted for production. I’m gonna archive the current model that’s in production. So again, if this is gonna be promoted, this is gonna take the model that’s in production now and tag it as archive and then flip. So the current model is in production. So it archives the previous one. And then it’s gonna, that last cell down there is actually promoting to the production stage. And again there is this audit path and this governance path of understanding what models are moved in and out of production by who and you have a clear reasoning of why.
So I’m gonna turn it over to Pete to kind of walk through the beginnings of the Continuous Deployment part using Azure DevOps.
– Thanks, Mary Grace. So a lot of the steps that Mary Grace was mentioning it’s for building the release, nothing has actually been pushed to the target environment yet. This is just preparing the artifact and marking all the code that is going to be eventually deployed. And so the next step is actually to do the deployment. And so how often you run this, what triggers this really varies based off of your requirements of your project. So one common way to actually trigger a deployment is by listening for whenever there has been a commit done on a master branch. And so that might be, somebody has done some pull requests, they’ve been reviewed. And then that code is merged into the master branch which creates commit. And then your CIC process will recognize that and say, okay, I have new code to actually be deployed. And then it’ll take that bundle and it’ll go through the deployment process. And so really what we see here in this example is all of the steps that are included, I will exit sorry.
We see all of the configurations that are required for this deployment. This is really you telling the process, where am I pushing this to? How do I connect to there? There’ll be things like obfuscated usernames and passwords that are here. But then there’s also gonna be configurations in terms of how do I wanna execute on this particular environment? So, and Mary Grace said, it doesn’t necessarily always tie to, you may not always be promoting to Dev QA staging and production. Your organization might have other stages that are in there, like a UAT or UAT 2 or different things like this. And so each one of those environments might have different configurations. And so when you’re configuring your deployment, you’ll wanna actually make sure for each stage or for each target, you have a clear configuration that’s defined for that particular deployment. – Again, we’ll just kind of continue to walk through what the AML looks like and what’s happening in the AML.
So you’ll see that the first step here is all about installing or rather defining the instance operating system. And then, so we’re defining first off by defining instance operating system. Then you’ll see the next step there is actually all about installing Python. So you’re setting your Python version, installing it.
And then after we’ve installed Python, we’re actually gonna install the Databricks CLI. And note that as you install the Databricks CLI that’s gonna be the mechanism that we’re gonna use to actually call the jobs API and interact programmatically with Databricks on the backend.
And then this last chunk here is all about actually configuring the Databricks CLI. So as you can see this is where we referenced one of those environment variables that was defined in the previous slide that Pete was talking about. And you can actually, again, like in this case, we have this secret token hidden. But you have the ability to define environment variables, pass it, whether they’re public or they’re hidden variables. So this is all about kind of the backend setup that’s required to actually kick off some of this automation.
– Okay, so great. In the previous slide Mary Grace was talking about setting up your environment, getting it ready to execute the deployment. And so in this next step we’re actually gonna do that we’re gonna do the deployment. And so you can look at these lines here in this CMO file and really what this is doing is it’s leveraging the Databricks CLI to do the deployment. So here, really what this is doing is this is taking code from the CI/CD server and then pushing it to the desired target. And in this case the desired target is your Databricks workspace. And so we’re using the Databricks CLI, which under the covers is referencing the Databricks REST API, which has endpoints that do most of the same things that you would do just clicking around your workspace. So you can do things like spin on clusters. You can move notebooks around different things like this. In this case, we’re using the endpoint specifically that deals with the workspace. And so you can see here we’re making a directory within our workspace. That’s referencing a variable that was defined previously. And then we’re actually going to import the code from the CI/CD server, which previous to that was pulled down from the source code repository. And we’re pushing that to the Databricks workspace. And so this is actually leveraging some of the configurations that we did a few slides back. So you can see here, I’m not actually specifying the specific workspace that I’m going to, but that’s already been configured. So this is saying, use the workspace CLI endpoint. You import the entire directory of all the notebooks that we wanna import. And then there’s different flags that are put on here we’re gonna say we’re gonna overwrite everything. And then we’re gonna give it the path to which we want deployment. And that’s really what the deployment is in this case. It varies based off of different type of code you have but in this case, it’s moving notebooks from the CI/CD server into the appropriate directories on the work space. – And now that we’ve moved those directories and notebooks to the right place in the workspace, the next step is actually building and spinning up the cluster to run this code against.
So again, this is all pulling from the Databricks clusters API. So on the left side there you’re specifying your cluster configuration. So setting up your instance types, your nodes, auto-scaling all of those potential customizations that need to be made to your cluster. And then on the right side is actually the code that’s booting up the cluster. And kind of what I learned as I was going through this is you definitely wanna make sure you kind of put that code to sleep a little bit, cause it’s gonna take some time to split up the cluster. So it’s gonna do a few check for the status and if it’s pending, pending, pending, and then ultimately you’ll see your cluster be spun up. – Once we’ve gotten to that point, then really what we wanna do is actually automate the not only, we’ve already done the deployment. So now it’s available on the correct target environment, and now we wanna actually execute it. And so we’re gonna do the same thing. we’re going to use that same CLI in order to initiate the execution of that job. And so once you do that, there’s a lot of different things that you need to pass to it, so it can function as expected. So you would pass any of the parameters specifically that the job might take in. In addition to it, to that, you might be passing in different things of, not so specifically to the code, but how do you want the job to run? Like you can see here, a max concurrent runs. How many do you want to be able to be executed concurrently. Things like timeouts and different things like this. And then eventually you can see here, this says existing cluster ID. And so we’re pointing it at a cluster ID that we’re assuming the one’s already spun up, but you can also do things like, I wanna start up in an ephemeral cluster. And so if you are gonna do that, then you’d have to put in all of the JSON that actually describes what all the cluster configurations that are there. And so really in this step, when you’re actually executing it, it’s all the same things that you would configure manually through the UI, but you have to explicitly define them within JSON. And then we’re gonna pass that JSON again to our CLI in this case the job CLI to actually execute this.
– To kind of, again, take a step back. I know this is kind of moving through things pretty quickly, but if we think about this process holistically. we’ve developed some code, so branch from our master branch on, whatever version control tool we’re using. Develop some code, you’re able to use MLflow to pick the best model that we wanna use to define, we define whatever the best model means. Pick that one to use we use MLflow to register it. And then now we’re actually using, we’re using Azure DevOps to set up the environment, set up our environment variables. Configure the Databricks CLI and then interface with the Databricks CLI to actually automate the running of these notebooks. And in this case, again, this instance is just for an inference job. So it’s running some code that’s ultimately running a bunch of batch inference. Which we didn’t show that specific code here this is more about like the overall process, but again this might look a little bit different if your deployment looks different. So if you’re deploying to a cloud native service, or you have to build the container there’s functions that are native to MLflows. So like a bill Docker function, you can build an Azure ML Container that can be shipped to these other cloud serving tools as well. So we kind of took a really basic example, like I said, but again, as complexity is added here, there’s just little changes that need to be modified whether it’s in the ML process or the ML code for your DevOps pipeline or in the actual like source code for your ML development. So we’ve defined everything we’ve defined our clusters. We’ve defined what our job looks like now it’s actually run the job. And so that first bullet in that first snippet of code uses the run now endpoint via the jobs API to actually kick off and run the inference job.
And so, while this is all happening through this automated deployment process, you can actually kind of verify this through the Databricks UI. So while we’ve called this job and run it from the APIs, it’s still gonna show up because we’re using the run-now endpoint, it’s still gonna show up in the Databricks UI. So say you’re coming back and you need to troubleshoot the job. You need to get access to the logs, something went wrong, or you wanna just investigate that job a little bit more. Look at runtime, plus utilization. You can do that programmatically again through the jobs API, but you could also use the Databricks UI to investigate these jobs as well. So while this process is kind of a hands off when you do have the flexibility to be able to jump in and navigate through the UI and pick up where some of the automation has left off. – So really in summary, the point we really wanna make is that MLOps is really important because really the end goal of it is being able to deliver your code in an efficient manner where you’ve ensured that it’s been properly tested and that all of the steps to actually promote that code had been followed correctly.
So you really get what you’re expecting. And by automating that process, you’re giving that feedback back to the developers very quickly if there is an issue that’s out there. And so that’s the importance of really the automation aspects
of DataOps or MLOps, or just operations DevOps in general. Specifically when we’re talking about MLOps, a lot of the stages are very similar to the ones in a traditional software engineering,
Software… Excuse me, are very similar to what you’d see in a traditional software engineering process. But things that might vary are things like what type of artifacts you might be deploying. So it might be rather than just having like a website code that you might be pushing up. Your artifact would contain things like a serialized model or different things like this. And then you’d have to worry about things like different ways to serve up that model. And so just in terms of like broad strokes, it’s very similar, but when you get into the specifics there are a few tweaks that you have to make to it. And hopefully we’ve shown what some of those tweaks are. And really, another really important thing to do is this is all things that you would do manually anyway. And so what we’re trying to do is build a more efficient process. So one of the ways to do that is use great tooling. So MLflow is really great in that it offers you those guardrails that you need. So ensures that you’re picking the right experiment while you’re actually kind of doing that discovery phase to figure out which version of the model code, you actually want to promote the next environment. And it gives you all those APIs to actually automate the processes of tagging it for the next environment and eventually building your artifacts that are out there. So MLflow is a really great tool to actually put some governance around your development process and automate the deployment of it as well. And then finally, to orchestrate all of these processes, there’s a lot of really great tools out there for this. So in this particular case, we use Azure DevOps, but you could use Jenkins or some homegrown ones that are out there. Really DevOps in general is a design pattern. You do the same stages, but how you actually implement it will vary based off of the tools that you’re using, but it’s typically the same steps that you’re doing at each stage. Just the implementation may vary based off of the tools you’re using. So yeah, Mary Grace and I, would like to really thank you for listening. And hopefully you found the topics that we discussed to be useful to you. And yeah, we really encourage you to actually look more into the automation of your development process cause it’ll make it a lot more efficient.
Mary Grace Moesta is currently a Data Science Consultant at Databricks working with our commercial and mid market customers. As a former data scientist, she worked with Apache Spark on projects focused on machine learning and statistical inference specifically in the retail / CPG space. With previous research in Markov Chain modeling and infectious disease modeling, she enjoys applying mathematics to real work problems.
Over the course of his 20+ year career, Pete has fulfilled many roles, including data engineer, web developer, trainer, consultant, customer success engineer and most recently tech lead at Databricks. Based in Atlanta, GA, he has delivered and managed projects of varying sizes across multiple verticals, including utilities, financials, higher education and manufacturing. In early 2018 he joined Databricks, where he specializes on topics related to automation for data pipelines.