Getting machine learning models to production is notoriously difficult: it involves multiple teams (data scientists, data and machine learning engineers, operations, …), who often does not speak to each other very well; the model can be trained in one environment but then productionalized in completely different environment; it is not just about the code, but also about the data (features) and the model itself… At DataSentics, as a machine learning and cloud engineering studio, we see this struggle firsthand – on our internal projects and client’s projects as well.
To address the issue, we decided to build a dedicated MLOps platform, which provides the necessary tooling, automations and standards to speed up and robustify the model productionalization process. The central piece of the puzzle is mlflow, the leading open-source model lifecycle management tool, around which we develop additional functionality and integrations to other systems – in our case primarily the Azure ecosystem (e.g. Azure Databricks, Azure DevOps or Azure Container Instances). Our key design goal is to reduce the time spent by everyone involved in the process of model productionalization to just a few minutes.
In this talk, we will discuss:
– How we think about the MLOps problematics in DataSentics in general, what are our real-life model productionalization experiences and how it affected building of our MLOps platform
– Demo of model deployment on our MLOps platform
– Lessons learned and next steps
Speaker: Milan Berka
– So, hi, everyone, it’s a pleasure to welcome you here into this session. I’m Milan Berka, I work as a Machine Learning Engineer at DataSentics, and today I want to share with you our experiences and lessons learned from building MLOps platform around MLflow. So let’s get to it. First, I want to sum up agenda for this session. First, I will say few words about DataSentics, who we are, what we do. Then I will touch on motivation, why we are doing it? What are some common analytical problems we encounter and we try to solve? And then I really want to deep dive into this MLOps topic, how do we understand it? What it means for getting DataSentics model into production? How would it fits in our platform? And I want to really show you the demo of how this is done. And then last but not least, lessons learned, next steps, we will touch on demo also. And so few words about DataSentics. We like to say that we are machine learning and cloud data engineering boutique. We are based in Prague, and our mission is to really make data science have an impact on organization, because lot of times we see that data science is perceived as something of a silver bullet, it’s easy to do it, has instant success, but the reality is, and we see it’s firsthand with our customers and also at our own internal projects, that it’s very difficult to have impact with the data science. And to there is a lot stuff that must be done in order to make this real. So this is something we’re trying to help with. I also do mention that we are proud partner of Databricks and Microsoft, and also this Data+AI Summit. So if you find something could be this interesting, please find us at our booth here, and we can have a talk about it. All right. So, our motivation. The problems that we commonly encounter, there’s a lot of them. It’s basically that a lot of times the model and subjecting the experimentation phase and nobody knows how to put it into production, as I’ve already mentioned, although of times it’s about that, there’s so much tools in the market and it’s like, everyone’s claiming that it will solve their problem but the road is quite different that you have to squeak and tweak them and it’s something that it’s not really easy. Also a lot of times we see that the data science workloads are not fully understood by the software developers and software architects, and this also introduced some sort of friction. So to combat these problems, we assembled a dedicated team, we call ourselves DataSenticts, and our goal is to really make the owners of these problems like happy. And this, whether it’s data scientists or whether it’s data engineer, whether is changing your security guy, we really want them to talk together, to have common tools and frameworks that or works together well, to really introduce this culture and to the system to make the data science solutions come to reality. And this is our mission, this is something that we want to do. And on the next slides I will briefly or sometimes deeply touch on these. All right. So first let me introduce some of our toolings that we developed in DataSentics. But before we get into the MLOps part, let me maybe yield a little bit earlier in the process, and start with something more basic. And that’s having a platform where the scientists and engineers and analytics can come in and they have the data ready, they have an environment where they can do the experiments, they can write a code, a sequel part, and to basically maybe do a one or two models, and really to have this basic infrastructure and cultural setup. We actually took the best of Azure, including Databricks. And we put together an infrastructure template, which can be taken to a completely new Azure subscription, and with a few clicks. and in terms of few minutes you can basically unfold this entire template into your environment. And everything is already there. So we have Databricks there it’s connected to Azure storage, it’s connected to Key Vault, everything is out of the books ready. You have also the DevOps processes set up. We are using Azure DevOps here, and you can come in and start basically writing code. And this, everything is in terms of a few minutes basically. And the good thing is that we actually took this to our customers, and also to banks. And it works well. And it even went through some penetration tests and security tests and it stood the tests. So this framework is actually enterprise ready. It’s easy to set up. It’s fast, and you are able to get up and running in a matter of minutes. The best thing, however, is that we are about to open sources, and this is coming soon. I will mention it again by the end of the slides. But, this is just a teaser, but we are already looking forward to getting this out to you. And if someone wants to get this set up up and running in some optimal way, he can definitely leverage our templates and our basically platform for this. Okay. And now to the MLOps. So the aforementioned platform that’s fine, but when you want to get serious and you really want to have tens, hundreds of thousands of models in production running every day, but some SLA, you really need to have a little bit more in terms of the system and platform. So this is our now main focus, and something I want to now deep dive in. And I want to get to the demo as soon as possible. But before this, let me just briefly touch on the parts that are required to make this work. And these parts are, so first we have some data sources. You basically, whether it’s some streaming sources or database house sources and the likes, you want to take the data, and as a scientist, you want to really come up with the features. This is where feature store enter the picture. It should be like one central place across maybe the entire company or organization to store your features. And it should be curated. You want to register a feature, and it should be taken care of, that the feature is of a high quality, and it can be used further. And then as a data scientist you basically want to do some feature store, get feature command, you take the features, and you start training your models. And if you come out with a new feature, maybe you can register it back to the feature store. Once you have done model trained, you really want to make you really want to like register it. So this is where model registry come into picture. So you want to register the artifacts of the model, but also metadata around it. Who trained the model? What was the performance? What was the features, and stuff like that. And from there, you can basically take the model, and you can start thinking about productionalizing it or deploying it, or building the serving pipeline service application which will then answer the request on the model. And the here, this can come in different flavors. You can maybe spin up API, you can set up a streaming job, or you can set up a fake job. And this is actually quite complex because there’s a sort of hidden complexity here. And this is definitely an interesting topic. And then last but not least, we have monitoring, so we want to make sure that all your models are running. You want to monitor that it’s running okay, and maybe if the world’s change, and the data that comes into the model changes, you want to know about it, you want to be noticed. And if something goes really sideways, you really want to be able to react to it and maybe retrain the model, or come up with a completely different solution. All right. Here I do have like blueprint that we will see in the demo. So now really fast. First, you will see how the experimentation phase begins like the data scientist comes and he wants to create an experiment. He wants to do some data analysis. He wants to do some fast experiments. He will load experiments into the model registry. Then once he is done with this and he wants to be serious, then there is this part of grading like fortifying the code and creating a robust application which can then be deployed as a retraining application. And once we have this continuous training set up we also want to move further and start building the serving application which will then serve the models. And this is like the end phase where the other business processes will then request this serving pipelines and get the predictions. All right guys. So it’s a demo time, and in this demo, we will be in JupyterLab, we will be here as a data scientist, and this is machine language engineer letter. And you can basically do this from anywhere, from local computer, from Databricks, but we have opted for JupyterLab. And this JupyterLab is hosted in Azure, yeah. All right. So the first thing assume that I am data scientists and my job is to solve some business problem with our machine learning solution. So that I come to the Jupiter lab and the first thing I want to do is to initiate my project. I can do this manually, or I can use one of tool we developed, which is called an MLOps CLI. So let me just use MLOp CLI for this, task. All right. And so what I just wrote here MLOps CLI has the summit to the MDA, which stands for multilevel data assistant, and I will just write in it and in it will initiate this pro sets of great thinking project. So just to fill in some data, and so let’s create a project. All right. So I just initiated it. And right now, what will happen is that behind the curtains of the cookie cutter will take place and it will create a new folder for me. So I’ll just enter ASAP, just again, it needs a password. Now it should just want me to confirm everything. So let me just confirm everything. Or maybe I can rewrite something if I want. Okay. So right now we see that we have a new project here, which is called DA. And if I go in, I can see it, I already have a template, how this project will look like, or how it looks like it’s initiated from the template. So as a data scientist, I really want to start my job my process, more on the experiment side. So I have a notebook folder here, and my job right now is to open the notebook and I can start writing some experimental code. So I go into this note folder and I’ve already prepared a Jupyter notebook for me, and this Jupyter notebook, it’s filled with what I just passed into the this creation project, so it already has these names and stuff like that initiated, and already has some pre-packaged sort of documentation-like code, which I can now run and I can start doing my work. So let me just restart kernel and let’s just run these commands. So first is installation of the flow. Then we have initialization of some of the important functions we will use further. And now here I have a dummy toy example of Iris dataset, and this is really for me to delete and to play around. For the purpose of this demo, I just keep it there. And let’s just try to run this experiment and log it into MLflow. So maybe let me just let me just edit this size, something else. I’ll just save this and I’ll just run it. Okay. So this is the first time you are running this experiment. So it will create new experiments but then MLflow, So if I go over to the MLflow, I can see. And if I refresh, I should see. Yeah, I see it, a new experiment, which is called DA. And I already have this first run locked. And when I open it, I see a standard MLflow interface. And yeah, so I have my parameters metrics something I defined in the notebook. And then I have this model file with all the necessary artifacts and metadata to run the model. One very good thing actually is that I have my Git commit here. So, right now I can basically connect this run to a certain version of code under which it was run. And this is very need for reproducibility. Okay. And now if I maybe go with another set up of the experiment I can run this again, this time it should not really, let me just save it. This should not really make much difference so I can go back into MLflow and I can see my new round and I can see that it has the different Git commit, and it has a different model score actually which is interesting. Oh, so maybe expected. I just changed it trying to split and provide some model score. All right. So I, this is how I can start playing. And right now let’s assume that it’s time to stop playing and start to do some production work. So I actually do have two things I want to do. First is to productionalize this machine learning code training pipeline set up the continuous training. And then I want to maybe think about how to design the serving pipeline. So this is where the machine learning engineer comes in. And together with data scientist, they figure out how to fill up this source template. And here there are several files which must be filled. And one of which is a training model file another one which is config. And then we have some tests to run. So basically here what am trying to do is to take all my experimental code from the notebook and put it into the train model file which will then be run in production. This will provide a continuous training. So if I open this, this is how it looks to me if I haven’t done anything. So this is again, some sort of template. And as a machine learning engineer and a scientist I am here to fill this up. The same goes for tests. So again, I have this prepared stuff, and I just have to put in these new tests, and this is up to me to design. Okay. So maybe let’s just switch over to a project that already does have this field. So, we already have this project CRX which is about credit ratings. And if we go here into the source folder and into the model folder, we already see that these things are pre-filled or filled actually. So the machine learning engineer came in and together with scientists they actually did their job and converted everything from notebook into this train model file. So here this particular case is about creating a pipeline the model that will score a client based on his credit ratings. So we have some pipeline steps here, and then we have this train function filled, and everything is called within this training model function which is then called when the training actually is happening. And also, I believe that in particular example, the tests are as it were. So we have this dummy tests, but if we need it, we can definitely bring in some more tests. And maybe, the last thing is this MLOps folder, which is to be filled again by the machine learning operations engineer. And here, they have the training pipeline and a deployment pipeline which in our case is built on Azure DevOps. But this is not very a big problem to port into Git log or other automation tool. And this scope this particular pipeline is doing the training of the model based on the Git push. So if I push a new code into the repository where the project resides, I can retrain this model. And then we have this deployment, and again, the push, things will be automatically deployed and ready for a test or ready for production. So let’s actually do this. Let me just switch. So I just switched into the CRX folder where the Git is already matched with the, or connected to the remote. And right now, what I can do is that I can change something into model, some parameters of the model. So I can see that this model is utilizing a random forest classifier. So let me just change this value to do this. And this can be done by data scientists. So this is, we have everything prepared. I have the training pipeline set up, and all I need to do is basically to change the code and commit and push, and the training pipeline will just fire automatically. So this is like the next stage, experiment goes playing around, right now I have this file I can alterate, commit, push to Azure DevOps and the retraining will be done automatically. So let’s do it. I have created this new parameter, and now let me just push it to the Git. All right. And we are pushed. So right now, what will happen is that in Azure DevOps there is already, there’s already this retrain pipeline which is now reacting to this Git push. And it’s basically running the retraining algorithm or this retraining process which is defined in the MLOps folder. And if we click into the details, I can see that it is going through the build stage. It’s basically taking the artifacts from the Git repo. It’s preparing the environment and the training will soon begin. And, maybe one very nice thing about this is that the training is running actually in DevOps itself. So it’s running in the machine where DevOps is actually running, but, we can set up this to run on any platform that you want. So whether it’s Kubernetes server on your own prem or whether it’s a Dremel or it’s Databricks or it’s some other service, this can be definitely set up. Okay. So we are in the process of retraining, right now we are preparing the training environment, and soon the training will begin. And after we are done with the training, what will happen is that, deploy pipeline will run automatically. It will take the new model, and it will actually spin up a new ACI, which stands for Azure Azure Container Instances will create a new end point, and I can directly vary it. So this is something that will follow this training. So this is really where the automatization is in the strongest. All right. Yeah. So we are installing, right now we are running the training, hopefully it will run, if not it will be failed run in MLflow. But so far we are generating scripts and it seems that we will get this model done. All right. Yeah. Great. And we are done here. And we are actually running the test right afterwards. So we can see whether the tests passed, and if something goes wrong this pipeline will shut down, and we will not have a new bet, maybe experiment doesn’t go through the test. Okay. So if I go back to MLflow and I refresh my page, I should see a new experiment it’s here, and, we really see here that it was great before, it’s on this Git commit. And again, we have the new model. And now this was the training. So this is like new way how do they design this can really train, it’s just based on commit this source folder. And if we go back, we see that the serving, or the deployment pipeline to the serving environment is running also. So if I click here, I can go to the stage, and I can see that the release is happening. And again, there are some steps that we have to go through. We have to download the model. Then we have to deploy the ACI, which stands again for the Azure Container Instances. And once this is ready, we will see that we can actually access the end point and we can claret directly. And this is all happening automatically. So for me, as a data scientist, I have one comment, and everything else is happening automatically with a little work from machine learning engineer to set this pipelines up, and we are here to help them with the templates. Oh yeah. So right now we are deploying to ACI. This might take a while, but once we are deployed, we can actually leverage the new end point. Great. And we are done. So right now, let me just grab the new ACI, URI, and we already, let me just clear this one up. So we have all these files open, and here it already has prepared post requests. So let me just replace the, the URI. And right now I can basically, well, create the endpoint and I get the result. So I have new model, the point and the production, fairly easy for the first, and friendly for everyone involved, hopefully. All right. So here we are. We are at the end of the deployment. And our very last thing is which I want to mention is that this is not very tense, there’s like next step in this monitoring. And here our approach is that we built a monitoring library, and this library should really work quite independently on anything else. This is a library that can be run in Databricks or in special Booker container. And basically this library is about connecting to data source, which represent the output of the model. Then it has predefined metrics which it can compute. And we are here computing, the skew and the drift, the model or more precise load of features. And then it can basically push these metrics to a target of your liking, whether it’s a MLflow metrics or whether it’s some Power BI report or whether it’s just the plot in your notebook. This again, depends on what’s the preference. So if I go here, I can defill down these. I can run this. And it’s calculating the mean drift and skew. And here I have, yeah, so this is like from four days before we haven’t updated the data set, but basically here I have like two model skew and model drift historically. Now, if I want I can definitely pull this and I can see how it behaves. And if it exceeds certain threshold, I can react to it. Either sending some other messages or automatically retrain my model. Okay. So this is the demo. Let’s maybe just debrief what we saw in this picture. It’s actually broken into the parts. So we saw the model development assistant connections creating the project, creating the templates. Then we saw the MLflow open source connection registry. And then there was the serving parts, serving manager which is consisting of the deployment pipelines, and also means how to deploy the actual deployment pipelines. And then we have this monitoring and alerting. so model drift, skew monitoring, and alerting if something reaches threshold. There is a lot of lessons learned that came from this. But one very neat thing in particular was about MLflow. So the open source MLflow is, I would say quite naked. And you have to really reinforce it with the authorization mechanisms. There is some surprising, but the artifacts actually doesn’t go through the MLflow but you have to upload them maybe by the MLflow client, but directly into the storage underneath MLflow. So there is a lot of lessons learned, tried to summarize them into slides, but there’s many more, but this MLflow things are of particular interest. And the next steps, we just want to push this further we want to introduce some common interface front and maybe even, and tighter integration feature stores and richer deployment offerings. So always on the run. As I mentioned, I want to mention and highlight it again. We are open sourcing the infrastructure part of the basic platform, and we are really looking forward to any feedback on this, and any companies that want to go along with us, and challenge it and contribute to it. So it should be open source soon. Now, very last thing is say kudos to the Platform team at DataSentrics. And, I just want to say that this is part of, or this is work of many people, it’s not easy. And we are always looking forward to welcoming new members of the team. So this is a recall. Great. So thank you very much. I hope you enjoy the rest of the Data and AI Summit, and looking forward to any feedback.
Milan Berka is a ML architect at DataSentics a.s. After he finished his mathematics and stochastics college degree, he started pursuing a career of a data scientist. However, soon it became clear that without a proper data infrastructure and data engineering element, it is very difficult to make a lasting impact with any data science model - regardless of how great the model itself is. Therefore, almost four years ago, he jumped over to "more engineering side" and started building experience in cloud infrastructure, big data frameworks, DevOps practices and other engineering topics. Combining the machine learning and engineering knowledge, his primary focus now is designing and building solutions which ease or even enable the productionalization of machine learning models (MLOps).