Consolidating MLOps at One of Europe’s Biggest Airports

May 27, 2021 11:00 AM (PT)

Download Slides

At Schiphol airport we run a lot of mission critical machine learning models in production, ranging from models that predict passenger flow to computer vision models that analyze what is happening around the aircraft. Especially now in times of Covid it is paramount for us to be able to quickly iterate on these models by implementing new features, retraining them to match the new dynamics and above all to monitor them actively to see if they still fit the current state of affairs.

To achieve those needs we rely on MLFlow but have also integrated that with many of our other systems. So have we written Airflow operators for MLFlow to ease the retraining of our models, have we integrated MLFlow deeply with our CI pipelines and have we integrated it with our model monitoring tooling.

In this talk we will take you through the way we rely on MLFlow and how that enables us to release (sometimes) multiple versions of a model per week in a controlled fashion. With this set-up we are achieving the same benefits and speed as you have with a traditional software CI pipeline.

In this session watch:
Floris Hoogenboom, Data Scientist, Royal Schiphol Group
Sebastiaan Grasdijk, Senior Data Scientist, Royal Schiphol Group



Floris Hoogenbo…: Well, nice to have you here at our talk, Consolidating MLOps at Schiphol Airport. Before we begin, let me introduce myself. I am Floris Hoogenboom, I’m the Lead Data Scientist of Schiphol Airport and I’ll be presenting to you today, together with Sebastiaan, maybe you can introduce yourself, Sebastiaan?

Sebastiaan Gras…: Yeah, sure. Hi, everyone, my name is Sebastiaan, I’m also a data scientist at Schiphol and I’m in one of the teams with Floris. Thanks.

Floris Hoogenbo…: For today, we have a deep-dive session prepared for you on how we at Schiphol Airport consolidate MLOps in all of our machine learning projects. And, although it is a deep-dive session, we won’t be covering all concepts from the ground up, rather we will show how we use MLFlow, show how we use Databricks for that, but we won’t go into explaining what MLFlow actually is and what it all can do for you. So we presume some prior knowledge to MLFlow, even though if you don’t have that, I think you can follow along. But we’ll also show some concrete examples, some code, of how we use MLFlow, so it is beneficial in that sense to have that.
Before we kick off and dive into the content, I would like briefly like to take a moment to give you a bit of background on what we’re actually doing in Schiphol. And you may be familiar with Schiphol. Schiphol is a large airport in Europe. In fact, before COVID hit, it was the third airport we have in Europe, in terms of passenger numbers, and probably to Asia or to Australia, or somewhere else around the world, then you might have transferred at Schiphol. And in 2019, before COVID, we handled around 72 million passengers, which is of course, quite a bit, also compared to other airports.
What you might not know is that Schiphol is not only the airport we have in Amsterdam, but rather the Schiphol group is far larger, which also manages airports in, for example, Australia. We manage part of JFK, Terminal 4, so there are other airports around the world where we take our knowledge base key we have of Schiphol and apply that to airport management as well. And one thing that makes Schiphol, sorry, the Amsterdam airports special, is that it’s the oldest airport, worldwide, that is still in the same location where it was founded. More than 100 years ago, Schiphol was founded as a military airfield and it is closely situated near to the city of Amsterdam. And that comes with a challenge, of course, because you might imagine that 100 years ago, we didn’t expect to be handling 72 million passengers at our busiest times. And that means that there is not a lot of room for expansion, in Schiphol, but still, we’re faced with some capacity constraints, we need to handle more aircraft, we need to handle them more efficiently, and also in this time, that’s of course if you have a lot passengers, well now you can imagine that passenger numbers have declined quite a bit because of COVID, but now we’re faced with another challenge; it’s “How to do handle a lot of people in a safe manner across our terminal?”
As an answer based to that capacity question, Schiphol, a few years ago, started what they call ‘Schiphol Digital’. Schiphol Digital was an effort to use digital technology to make Schiphol more efficient, more safe, and a better experience for passengers. And as you can imagine, data science and machine learning is a very important part in making that transition happen.
So what I would like to do before we dive into the content, is give you a bit of a picture on what things we’re actually working on. And maybe the most obvious thing, of course, to start with is that an airport we’re dealing with operations. We’re dealing with aircraft coming in, we’re dealing with a lot of different airlines that come in from all over the world, and one of the things we, for example, do is we try and predict block times, which are essentially the times that an aircraft docks at our gates, starts unloading cargo or starts disembarking passengers, and we predict also when that aircraft will be gone again. And this is, of course, a very fundamental process around the airport, so we take information like, “okay where is that list of flight plans, but also where is that flight situated now, for example, based on radar data, what do we know about weather conditions, other flights have come in,” and we try to make such predictions.
And one of the things we use that for, or we are looking into using it for, is for example optimizing gate planning. We’re doing gate planning dynamically, because you can imagine if you know at a certain moment that a flight will come in for a certain gate but you also see that our predicted departure distribution for the flight that’s currently at that gate crosses the point where we predict the next flight’s coming in, then you might want to take action. So we’re also looking at, “okay, how can we make those predictions actionable, and not only provide the insight, but really act on it to make our operations better?”
Another thing in the operations domain we’re doing is related to passenger flow. So you might not experience this as a passenger, but actually our airport and this is also what you see on the picture there, is a sort of a complicated system where you can flow through as a passenger, in that sense, like water flowing through a set of pipes, with a set nodes connecting them. And one of the things we do is, we predict based on the current occupancy, where we have infrared sensor all throughout the terminal to determine, “okay, how busy is it somewhere in a certain area right now?” We try to predict, “okay, how many passengers will there be in the coming hours?” And based on that again, we can take actions, we can for example route flights to other piers, we can decide to send passengers through a different arrival filter where they need to do, so a different border control post, basically. All to prevent crowds appearing and especially in the most recent times, this has of course been more important than ever, since this also allows us to do some form of crowd control, to ensure that everybody can keep a safe distance from one another.
We do things for our passengers, so this was really operations-related, but we also try and provide insight and a better experience to our passengers. So one of the things you see here, for example, on the left screenshot is an app, or is part of the Schiphol app, where we use wi-fi signals to predict these kind of heat maps for passengers, where we also try and indicate, “okay, where is it busy in the terminal, where is it quiet, where should you go if you want to have a quiet place” and this is, again, something that was sparked in the last year because of COVID. And another thing we do, which is maybe not necessarily passenger-related, but that’s the right screenshot, is that we for example, for people living around the airport, like I said Schiphol is situated near a city, and that also means that there’s quite a bit of disturbance of course, of aircraft. We try to predict, “okay, how many flights can you expect overhead?” So based on the flight patterns and the routes they fly, we provide this insight where you give us your location and we give you a prediction for the coming 48 hours on the number of flights you might expect overhead. And hence, then, we hope by that giving you a bit of insight into knowing, “okay, you can keep your bedroom window open at night” or “you can expect it to be a bit noisy”.
And lastly, one thing we do is maybe more on the data-gathering side of things, actually. We try to create insight into a lot of process we have around the airport. And by that, I don’t necessarily mean insight by analyzing data, but also by thinking of innovative ways how we can collect data from complicated processes. So one of the things that happens at our airport is of course a flight comes in and it docks at the gate, but at that point it needs to be cleaned, it needs to be refueled, some technical maintenance might need to be done, passengers should board.
And we have this of course, not for one airline, but we have this for many different airlines, and many handlers so parties who basically take care of that these actions are done around the aircraft. And what we do is, instead of having all kinds of API integrations with all those different parties, which you can imagine that of course becomes really complex quite rapidly, is we use camera images and we use computer vision to detect what’s currently going on around the aircraft. And by that, we generate timestamps of all those processes and that’s also the picture that you see in the top right of this slide. We generate a lot of timestamps on when did those process start, when did those processes end, also with the goal of eventually providing a sort of benchmark and maybe also providing an operational prediction or even improving that model for, “okay, when will this flight be ready to depart again from this gate so that we can use it again and optimizing gate planning.”
So lot of things, a lot of different things. A lot of interconnected things. And as you can imagine, these are all use-cases where machine learning or predictive modeling is a a very important aspect in.
I think I’ll hand it over to you then, Sebas, to take us into the MLFlow part.

Sebastiaan Gras…: Yeah, thank you very much, Floris. So, as Floris mentioned, we’ll move in a little bit to the more in-depth as to what we’re actually doing on the technical side of things. So, we see our Goal of Today is to show how we implemented Machine Learning Operations and how that enables us to keep applying to Machine Learning in a constantly changing environment, because, hopefully, as Floris showcased a little bit, it’s quite a dynamic environment we’re dealing with at Schiphol. So, first we’re going to give you a little bit of background on motivation as to why we needed that MLOps. Then, we’ll head onto how our training’s set up and how we us MLFlow and Databricks for that. Next, we’ll go from, “okay, we have a model, but how do we actually use it in production and how do we get it to production?” And then, we’ll close off with, “okay, we have a model in production, but how do we monitor that model and make sure that it’s working as we expected?”
So on the motivation side, as mentioned, Schiphol’s really a very dynamic place, where we apply Machine Learning in. So just about every day, some physical aspect of the airport changes, which can mean that the dynamics of whatever we’re trying to predict will be different. The example that’s given here is the PAX flow, so Floris talked about passengers flowing through the airport on the terminal side, but what can happen is that, for example one of the hallways connecting two lounges is put into maintenance, that means that they can’t really move any more from that place to the next, or an entire lounge is put into maintenance and that would mean that maybe models surrounding that area will use data that’s no longer working or relevant.
So most things we can actually capture in our models but there are still quite a number of things that we are not able to. So those maintenance things, for example, we don’t always know when they will happen, but you can also imagine that there might be some long-term incident that happens that we can’t really foresee and we have to adapt to, to stay helpful and be able to offer insights for passengers and the operation.
So keeping track and monitoring models in production, therefore, was already a big task, due to that ever-changing circumstance and coupled with that, we also quite often release model updates because these changes occur. So for example, we want to incorporate new data sources or part of data sources are put out so once again, for example, those lounges, so we want to make sure the data we put into our models is most up-to-date that we can, not just on the training side, but especially on the inference side.
But maybe a little bit of a step back and first talk about those long-term incidents that can happen. Of course, I think the big one on all of our minds in the recent time was COVID. Here you see an example of what that meant for Schiphol. What you’re actually seeing is one of our runways, Aalsmeerbaan, as you can see by “36R”, and it’s being used as a parking lot for KLM aircraft, which is really something that’s just about unheard of. And what that meant for us and also for the airport, of course, is that really the entire situation changed overnight and just about all of the historic data that we had no longer really applied to what we were trying to predict. Because the number of flights dropped, there were a lot fewer passengers than normally, so those dynamics really changed, and not just on the data side. The physical aspects of the airport also changed quite a bit, so Schiphol went back to what they call “Core Schiphol” where they really looked at what parts do we still want to keep functional to handle our demands for the airlines and quite a number of piers, for example were also put in kind of a storage mode for aircraft, where they park the aircraft but also moved up a lot of maintenance to do the maintenance that they were already planned on doing, but they were able to do faster. So quite a number of changes all of a sudden.
And that then falls back to, “Okay, but how do we actually deal with that in our models?” So our training setup, using MLFlow, is I think really quite standard. On the right, you see a bit of examples on what we have, I’ll shortly talk a little bit about those. In general, we have quite a strict format for all of our models. So we have a Python package that contains our library codes, but including that like the training application, as well as the inference application, so nicely bundled together. And if we want training models that just entails you have to install the package, then use your fixed-entry points that we have in an ML projects file, you can run your models and try to keep those entry points the same everywhere. So on the right is a bit of an example, hopefully nothing too wild, but you have your entry points, you define a number of parameters, and then you have a number of commands that you use to run your model to train them. And what then happens is that these get stored in MLFlow, along with a number of artifacts that we have created for a situation like the one you see at the bottom right, that’s a Plotly plot of our predictions to make sure that our data scientists are looking at the same source of truth once if we want to evaluate our models.
So it may be interesting to go a little bit more into depth on that training setup. Let’s focus first on the purple square that you see. So what we do is we have a custom MLFlow run script, so that’s at the bottom left. It’s a bash script, where we do our ML run and what that then does, that MLFlow run is, it grabs a version of our Git Repo, downloads that, and then along with the parameters that we have for that entry point that we use, these then get uploaded again to Databricks, where a cluster is being spun up for that job that we’re doing. So it’s just a training job that gets done and then all that happens in the cloud, so not locally, and once the training job is done, we store the models, the metrics, and artifacts all on the MLFlow, and the cluster is then turned off again. So that’s really quite a neat setup, and quite easy, because that means that we don’t really have any dependencies from local users or changes that they might have accidentally made in their local code, because it’s all based on Git and that makes it also really nicely reproducible.
So that’s I think the coding side, but one important thing to note, as our smiley tells us, there is DEV in machine learning. Because what we’re actually doing, of course, is we’re doing machine learning, predicting with models. And what that does, is that deals with data. And what we, as data scientists of course, think is the most important data to use is production data, because that’s the data we will actually run our models on. So there are quite a number of organizations that you see engineer in DTAP flow, for scientists, so they work on DEV to train their model with DEV requirements. And like the square on the right shows, that can work, if it’s just a DEV situation from the data scientists and not the DEV environment from the data engineers, because if it’s data engineer DEV requirements, that can mean that the data scheme must change or there might be a lot more noise in the data or still a bit more of an experimental setup. But if we’re actually using the production data, then we know at least that the data we’re using for our own models reflects what we’re going to use to predict with. So that’s why we have separate data science, dedicated data science workspace, which you see at the bottom left in the square, which is a read-only copy of the data from production. So that way we have our correct information.
And then we have a model and we want to run inference, so there are a number of types of models that we deploy, specifically batched, streaming, and some event-based request/reply. Something maybe to give you and example for batched predictions are the Block Time predictions that Floris talked about earlier, so when is a flight going to be at the gates or leave the gates, that’s a job that just runs every 15 minutes, and does its predictions with the last known information. We also have some streaming predictions, such as what the baggage time on Belt is going to be. And then lastly, we have a Request/Reply kind of thing. So that’s something that Floris talked about for predicting the noise at your location. That’s an API in kubernetes as an example.
What’s nice about this is our way of integrating the models, for each of these deployments, is more or less the same, so we don’t really need a lot of custom work for those things. But mostly, for the rest of this talk, we’re going to focus on Batch predictions, the scheduled Databricks jobs.
Here we see a simple MLFlow Workflow, as MLFlow wants you to work. That’s a mouthful, MLFlow Workflow, but what you kind of see at the top left is the registered models in MLFlow registry and that should at least ideally be the basis that they talk about on what you need to work from. So what would happen ideally is that you grab a model and register that in an MLFlow model registry, and then depending on what workspace you’re working on, you would pull that model from that environment and work with that. So, we have a dedicated workspace, you do your training there, you register the model in the model registry, and then if you want to use it in development, then you get your untagged model from the registry, if you want to do it in acceptance, you get the staging version, and for production, you get the production version. To give an idea, something like at the right is what you would use. So MLFlow.sklearn, you load your model, then you do your predictions with this.
On the base, this seems like a nice flow to work with, however there are a number of pitfalls relating to this problem, or to this way of working, I should say. And Floris is going to talk to you about what those pitfalls are, and more importantly, what we did to resolve those.

Floris Hoogenbo…: Yeah, thank you, Sebas. Yeah, it seems so nice, yeah, having just this simple MLFlow script, and then being able to deploy your models like that. And it’s not that this doesn’t work, it does work, but it’s a bit of an oversimplification, and there are a few things that you might not realize at first when you think ‘hey let’s deploy those models like this’. And what I want to do, I want to go through them and then we’ll do a deep-dive on the most important one, but more importantly, later on, we’ll also show you how we resolved those issues and overcome those pitfalls, basically.
So the first thing you might notice from the typical MLFlow diagram, as the MLFlow documentation presented, and Sebas presented it on this last slide is that there’s sort of a cross-environment dependency. I have the data science environment, or at least the data science model registry, which you might see as being in front, and then you also have dev and acceptance environment rating from that. And depending on the company and the type of industry you work in, this might or might not be an issue for your security department. Personally, I find it a bit of a non-issue. We all have package registries, that are also crossing those environment boundaries, but it is something to realize. It’s also important to note here that, since a few months, Databricks also actively now supporting that shared model registry across workspaces, so you can do this, and it is a feature of the database product. So that’s for the first point.
The second point, which I already find more interesting is that this comes with a lot of runtime dependencies. And that means that if you see this in the setting of Batch job, at the moment when we fetch the model, is the moment when we’re making predictions, and of course all kinds of things may go wrong when you’re fetching those predictions, yes? That centralized model registry might not be available, or something might have changed in your centralized model registry which leads to those models not being compatible anymore, and this leads to errors. So these runtime dependencies, again, they’re not a big issue, but they’re really something to think about and to realize.
And then we get to the last two points. And the last two points are really the big things which make us go for a bit of a different workflow than the typical MLFlow workflow would look like. And the first being sort of stability assumptions that this little script on your model and your code base. And if you look at that little script, at line 10, you see a sort of a sneak variable called “data” and that might sound interesting, but that’s of course something that needs to fit your model and needs to be generated in a way that it matches your model, and that’s where the model registry really runs into sort of problems with this setup. And I’ll go through that a bit deeper in my next slide.
Before that, let me state the last point, something that we identified here is that this also leads to some sort of non-atomic deployments, where don’t have one source of truth, but rather we there’s something in the model registry that’s determining what’s running, we see in Git that’s determining what’s actually a prediction job that’s running, and it’s also something that’s hard to reason about.
So before we jump to the solution, because we want to provide you with a solution, and it’s definitely our take that MLFlow is a great product, but this is really something to think about. Let’s dive a bit deeper on the last two points.
So that stability assumption on your code base and what’s really important to realize is that there’s sort of two versions or two definitions of what a model actually is. As a data scientist, you might see your model in a very narrow sense, like being only the algorithm, so something that accepts some input and produces some output, but you might also see the model as being a software system that predicts a bit. And that’s sort of a discrepancy between those two, because especially if you take the narrow view on the model, so only that algorithm, then that is actually sort of piece – yeah, it’s an algorithm, but that has very strict specification, it’s a very strict API.
And not only in terms of I expect my inputs to be a data frame with these columns, which is one head that’s a part of the API, and we’re used to that kind of API in software engineering, quite a bit of course, where you have a set of arguments you need to parse, well in this sense you might see arguments as columns so that’s not really the issue. But also, those models come with sort of implicit specifications on the API of how should those features look that go in and what are the allowed features that go in? And that means that there’s a very narrow coupling or a very intense coupling going on between the model that you deploy, so really the trained artifact in that sense, that you produce as a data scientist, but also the inference code that that model is compatible with. And you cannot decouple those two, so you cannot say I run every version of my model with any version of my inference, no those two need to be compatible. And I always like to call it feature compatibility, which means that you can put in the features that you generate. You can put them into the model and they adhere to those two processes.
But if you look at the script you see here, for all I know, that model in the model registry might have changed something in the pre-processing steps that lead up to that variable data that I then would also need to change in my prediction flow. And that is really something that this script forgoes. So that’s really something to realize that deploying models is much more than only the deploying the algorithm, it’s also about that feature compatibility.
And what does this mean in practice? So what might happen is I might have a data scientist who generates a new release that doesn’t use a few features, I might have an older release that did use those features, well then with that new release, the data scientist might have to align with what the engineers write and it might say, “well, I have my inference code updated”. But then at a certain point, if you want to revert, somebody might think, “well, I see that model registry here, I see version 14, let’s revert to version 14,” but then you run into problems. Because then the inference code is bound to actually to that version 16, and you want to revert to version 14, in this case it breaks transparently because those features were dropped, so it will error, but it might also break silently, and fail silently in that sense, that it just produces predictions, but those predictions are bogus. So this is really something to keep in mind, and in that sense, this makes script and the model registry UI a bit dangerous if you don’t think about that.
The second thing I mentioned is also that, and it’s related to this point, this prints you sort of a non-atomic deployment, because you have a version in the MLFlow model registry, you have a version of your inference code base, and together, they determine what’s actually the output of the inference job, like you see here. And that means that there’s no single point where you can say, “Okay, I want to go back to this version.” There’s always two points, there’s the inference code base and the MLFlow model registry that you need to register to know, “Okay, what’s the output that’s being deployed in that inference job.” And that’s really something to keep in mind. And something we didn’t like for our setup, so we thought, “Okay, how can we solve this in a different way?” And basically, what we came up with is, we wanted to go for a setup that puts our CI, so just our normal software deployment flow, in the lead, and makes Git really the single source of truth for anything we do, related to deploying machine learning models.
And what I would like with you, I would like to take you through this flow and explain how we do this and also those pitfalls we’ve identified.
So let’s take a hypothetical example: there’s a data scientist that wants to make a change to, for example, the passenger model and what he does, well he adapts the code base to train his model, he stores those changes in Git, then kicks off an MLFlow run from Git. So this is exactly the same as the training setup that Sebastiaan already presented to you. Then the data scientist goes into MLFlow, looks at those artifacts, judges the quality, has a bit of back and forth with his colleagues, and decides, “Okay, is this good enough for me to take this model into production?” And if that is the case, what we then ask of our data scientist, don’t merge that code yet. But we only accept merge requests on our repo that both change the code for your model as well as change the inference code in one go.
And this is also the reason why per project, we have sort of a model repo and it sounds really heavy, but actually what that means is we have a repository where we store the code we use for training and the code we use for inference in the same repository. And that allows us to have a MRs that if we want to merge something to master, we know that if our training code is updated, that also the inference code needs to be updated with it, and we can review that in one go. And this is really enabling for data scientists because now data scientists can take that model to production basically themselves, without having to ask an engineer to update something somewhere else.
And so what happens? The data scientist creates a merge request on that repository and what he does is the following: he updates the inference code in a similar way he updated his training code, so he’s still in the lead, he knows how he created those features, he knows what needs to be changed on the inference code, and he updates a small configuration file. And what that configuration file does basically, is you see it at the bottom, in line 2, it specifies the deployed run ID. And that is just a run ID of an MLFlow run, that is now referenced as “okay, this is the model I want to deploy in this flow.” Well, then our CI pipeline kicks off, we have the usual stuff, unittests, linting, et cetera, what everybody does.
But then the interesting part starts. Because what we then do, is we don’t deploy a thing we use that MLFlow load model in basically the code that we deploy. No, rather, we use in our CI pipeline, we have a step, that basically fetches the model from the MLFlow run, and integrates that into a single deployment artifact that contains, one, the inference code and secondly, the model artifact that it needs to be deployed with. And that has two advantages. One, I have one artifact that can be deployed without any runtime dependencies, so that’s nice, I’ve basically solved my first point, which was not a big point here. But secondly, this also provides one artifact that we can reason about and that contains a version of the inference code that is compatible with my model and if I want to go back to another version, I just go back to another artifact, basically. And I don’t have that two sources of truth, and I also solved the problem of feature compatibility, by that MR flow, by that merge request flow we added.
So basically, what you see here at the bottom right is, you see our linting steps, then in the build phase, you see something we do with config, that’s the first step, you can ignore that, but secondly you see integrate mo, or integrate model, it says. And that is basically where we fetch that model from the run and we integrate in our deployment artifact. And then with that deployment artifact that can be a docker container or in this case, since we agree that we’d be talking about Batches, it can be a Databricks job. We can schedule that, we publish data in one go, and the nice thing about this is that now reasoning about models in production is just the same as reasoning about traditional software because I can use my typical environment management tools that I, well we use GitLab, for any other Git provider, you have something similar, where we can keep on track, “Okay, what’s running, where?” And also, if I want to revert, I can just go back to a specific version of the code base. Our CI flow take care of deploying the right deployment artifact job, and that also means that there’s no more risk of me reverting something that would break in production.
And also here, we can just follow the typical flow typically flow S, so for feature branches, we can go to Dev acceptance and production for tags. So that’s basically our flow.
And then I think I’ll hand over to you Sebas, because you might now ask, “where’s the model registry” and we do still use the model registry, but I think you can better tell something about that.

Sebastiaan Gras…: Yeah, thanks. So the watchful viewer maybe already saw it, a few slides back on the CI pipeline, underneath deploy, there is a step called register_mod, which stands for register_model. So, yeah we still use it, but as mentioned by Floris, we use Git as the single source of truth for whatever we’re doing.
So the model registry we match that from our CI pipeline where we have a number of stages relating to the MLFlow model registry stages. So for feature branch deployments, we register a new version of the model if it doesn’t yet exist. If we push something to master, that means that the model in our MLFlow registry will go to staging and if we go to production, so we add tags on GitLab, then it gets updated to production in our model registry. And this is really nice because it also allows us to have a nice overview through user interface, where we can see which model is running where, which is nice. But it’s also clickable so that we can go to the actual runs that the models came from, so we can once again see how the results looked, so the artifacts that we store at the plots. And it also links back to Git, so we know which version of the code was used for that stage.
The general idea is that we wanted it to go back from that configuration file, so you see underneath the script, a step called, well a variable, run_id which references to that run_id, which is sorted on configuration, and then we have little scripts within the register model that it’s called. And that moves that run_id through the different stages of a new model that’s registered.
So that’s part of how we use the model registry. But it’s also interesting to note that of course, retraining of the models is something that generally takes quite a lot of time, especially on the scale that we’re doing it. So we also thought about, “Okay, how can we leverage that in project setup with that noise wave doing our runs in a way that we can automate part of that?” And the way that we do that is that we use Airflow for that.
So we generally kind of reuse a lot of the functionalities that we do for own manual experimentation, so the whole ‘add an MR’, run the batch correct for MLFlow trainer job, but we use Airflow tags to kind of take away some of that burden. So we rerun the models once a week for some of our projects and basically we shift our training test sets which are added, the dates for those are added as parameters to our entry point, making it nice and easy to schedule those.
Really important to note, we don’t automatically update our models in production, based on those retraining results. But we just take away that manual process of starting a run and all the work around it, but the decision to go live is still up to the data scientist, to make sure that we don’t have situations where our models are suddenly failing or doing weird things. But we still have some eyes on their models.
Maybe next about the automated retraining, as well. A little extension on it. So we don’t just retrain the models, but we also added a step, which you can see at the top right, it’s called “test production model”. To kind of compare the model that we’re retraining with the one that we have in our production setting, and see if it’s actually necessary to retrain the model and to put that new model up into production. So what that then does, is based on the run_id, pulls that model from MLFlow, and then using a separate entry point, it tests that model against the new test set, which is since it’s a timed series going to be further into the future, making sure that we don’t have feature leakage in that sense. But then once again, it allows to store those results and make it easy to compare the metrics of the model in production with the model that we just retrained, over the same test period.
A small example of how that looks is at the bottom. So since there wasn’t really a MLFlow operator yet that did what we needed it to do, we created our own. It’s here, it’s called the “MLFlow Run Operator” and we made it in a way to also be able to nicely leverage it between our different teams so that they were also very easily able to use that. So the workflow in that also is quite easy to use, where we for example use parameters in that operator, such as the entry point, which references back to that ML project entry point. You have a Git Repo that you can reference to, the version that you want to use, the parameters that you want to use for an entry point, which you see here added as ‘dictionary’. So an example given here is the tag which is the tag for the run, as well as your experiment id, so where should you actually store the runs that you do, and the Databricks job connection that you need to do the actual run itself. So that allows us to quite nicely and easily rerun our models and retrain them and take away some of the manual work.
And then, lastly, the monitoring of it, because, hey, we have models running in production that’s fine and dandy, but how can we actually make sure that they’re still doing what we expect them to do and that they don’t fail in a way that we don’t really see or for example, start drifting in performance? So what we do is, we take the metrics that come out of our predictive jobs and log those to DataDog, and on DataDog, we have a dashboard setup where we can just view the stored metrics over time for various projects. But more importantly, we also have monitoring set up, so if for example data sources start going out of function, so the feed is getting old for example, then a warning gets pushed to Slack channel, or if the model performances are way worse than what we expect them to be, then an alarm goes off and this gets also sent to that Slack channel, allowing the data scientist or engineers to quite easily take a look at what’s going on and to also be warned when it’s going wrong. And then if you want to actually see what’s going wrong, then we can use logbooks or Databricks notebooks to dive into those anomalies.
This entire workflow brings quite a number of benefits for us data scientists. Like Floris mentioned a number of times, it allows us, as data scientists to deploy models without any support, so we don’t have to go engineers whenever something goes beyond the testing model itself. And it’s also really important since we release new versions of many of our models quite often. And this also goes beyond just training of new data, when we can also quite easily extend our model by adding features or changing data sources and then the data scientist can still make it work and also make sure it’ll work on production or any inference time, which is quite nice.
So then the workflow that we then also have with GitLab and pulling everything from GitLab into MLFlow and Databricks allows us to have it fully versioned with a single source of truth, which is also important, because that tells us that if something works on a development environment, it’ll also work on acceptance and production because we have that single deployment package. So we no longer really have those cross-environment dependencies because our code and our model and our configurations are packaged in the same way, the same package. And that allows us to easily revert if something breaks. So rather than just rolling back a model, we’ll also then through those GitLab tags roll back the code, the model, and the configuration, which are nicely coupled in that sense.
So our key takeaways: we think MLFlow is really a great tool, but it’s not always a click-and-go solution. The fact that you have a standard way of storing and running your models along with any metrics and results surrounding it, like the artifacts, is really nice, and really helps in cooperation between the teams, and within the team, as well. Feature compatibility is an important issue to keep in mind, because your model is a lot more than just your algorithm, especially in changing environments, such as we have at Schiphol. If physical aspects of the airport change, then we need to make sure that our models and the features that the models use change with them. But having a single source of truth for that will make managing those models a lot easier and a lot more like managing traditional software. So all in all, this entire workflow, or having a proper MLFlow or MLOps Workflow enables a lot of speed into getting these machine learning models to production, because it’s just a nice standard way of working. Also, if we start up a new project, there’s quite a lot of infrastructure that we are able to reuse, allowing us to keep nice speed on that.
So hopefully we gave you guys a good overview of what we’re doing and how we’re leveraging Databricks MLFlow in our MLOps situation. Are there any questions?

Floris Hoogenboom

Floris Hoogenboom is the Lead Data Scientist at Royal Schiphol Group and oversees all machine learning & AI development that happens at the Airport in house. Together with four teams of data scientist...
Read more

Sebastiaan Grasdijk

Sebastiaan works with Schiphol as a Senior Data Scientist. He is responsible for the implementation of the so called Airport Operational Plan which brings together many of the AI application in Airpor...
Read more