Catch Me If You Can: Keeping Up With ML Models in Production

May 26, 2021 11:30 AM (PT)

Download Slides

Advances in machine learning and big data are disrupting every industry. However, even when companies deploy to production, they face significant challenges in staying there, with performance degrading significantly from offline benchmarks over time, a phenomenon known as performance drift. Models deployed over an extended period of time often experience performance drift due to changing data distribution.

In this talk, we discuss approaches to mitigate the effects of performance drift, illustrating our methods on a sample prediction task. Leveraging our experience at a startup deploying and monitoring production-grade ML pipelines for predictive maintenance, we also address several aspects of machine learning often overlooked in academia, such as the incorporation of non-technical collaborators and the integration of machine learning in an agile framework. This talk will consist of:

  • Using Python, Dask, and open-source datasets to demonstrate an example of training and validating a model in an offline setting that subsequently experiences performance degradation when it is deployed.
  • Using MLFlow, Prometheus, and Grafana to show how one can build tools to monitor production pipelines and enable teams of different stakeholders to quickly identify performance degradation using the right metrics.
  • Proposing a checklist of criteria for when to retrain machine learning models in production.

This talk will be a slideshow presentation accompanied by a Python notebook demo. It is aimed towards engineers that deploy and debug models in production, but may be of broader interest for people building machine learning-based products, and requires a familiarity with machine learning basics (train/test sets, decision trees).

In this session watch:
Shreya Shankar, Developer,

 

Transcript

Shreya Shankar: Hi, everyone. I’m Shreya. And today, I’ll be talking about machine learning models in production. The talk is titled Catch Me If You Can: Keeping up with ML in Production. I’ll talk a little bit about myself, my background. I’ll dive into a case study of deploying a simple machine learning model, talk about some of the challenges around that, demoing some tools that I’ve kind of built for monitoring and tracing ML outputs and ML pipelines. And finally, I’ll wrap it up with a summary of what we’ve learned and kind of how you can move forward from there and some open areas of work in the field.
A little bit about myself, I got my undergrad and master’s from Stanford CS. My undergrad focus was in computer systems and my master’s was in machine learning or AI. And after college, I joined a startup called Viaduct as a first ML engineer. And through Viaduct, I did a lot of things as people know, about real estate startups. I worked with terabytes of time series data. I built infrastructure for the large-scale ML and data analytics at the company. And my responsibilities were kind of all over the place, spanning recruiting, engineering ML product and more.
And this was very interesting for me because I had come out of a very academic experience in machine learning prior to joining the startup. And a lot of academic efforts are focused on the training techniques in ML. And there are a lot of interesting areas of work in fairness and generalizability to test sets with unknown distributions, ML security like adversarial examples.
But industry ML efforts are a little bit more different because there’s a lot more collaboration, I think. Industry places emphasis on an Agile working style. And additionally, in industry, we want to train fewer models than we do in academia, but we want to do lots of inference with these models. How do we kind of leverage these models as much as possible, which really begs the question that we never really asked in academia, which is, what happens beyond the validation or test set? And that’s kind of what I would like to explore today in this talk.
Unfortunately, we’re at a depressing state, I think, for ML in real life and industry or whatever. A good number of data science projects still make it to production, and this could be for many reasons. The obvious one being that data is not necessarily clean and labeled an industry, like the canonical benchmark datasets, like ImageNet-R. Data in the “real world” is always changing, especially if you’re working with time series data.
And the bottom line here is that showing high performance on a fixed train and validation set does not necessarily mean you’ll have consistent high performance when that model is deployed. And if you’ve ever done ML in industry, I’m sure that you’ve experienced this. And if not, hopefully, this talk will convince you that this may or may not be the case.
So, today’s talk is not about how to improve model performance or to do data science or feature engineering. Instead, I’ll be talking about literally what happens after you have a model and you deploy it, and showcasing kind of tools to monitor these production pipelines. So, how do your outputs of ML pipelines change over time? And how do we use these tools to enable people of different roles, Agile teams for example, to quickly identify performance degradation in your ML model? Hopefully this motivates, when should you retrain ML model in production.
So, here’s a general problem statement. For many people who’ve worked in industry ML, particularly in time series data, for example, it’s very common to experience performance drift, which I’ll define as kind of your performance of your ML pipeline, degrading or changing over time. And this is unavoidable in a lot of cases unfortunately.
For example, when you have large amounts of time series data I’ve worked with at my previous company, when you have a lot of high-variance data, and both of these can be used a lot actually in recommender systems or other applications. And intuitively, a model and a pipeline should not be stale. And what does that mean? It means that we want the pipeline to reflect the most recent data points.
So, in today’s demo, we’ll be simulating performance drift with a toy task which is essentially predicting whether a taxi rider will give their driver a high tip or not. And this is a binary classification problem, arbitrarily defined, but it uses time series data. So hopefully, you’ll see in performance [inaudible].
A little bit more about this case study, I use the publicly available New York taxi cab trip data that’s in an S3 bucket. And for this exercise, we train and simulate deployment of a model to predict whether the user gave their driver a large tip with the tools that we use today are Python, Prometheus for doing the monitoring. Grafana for visualizing the metrics that are being monitored and ml flow. Well, I’ll talk a little bit about MLFlow. And then mltrace, which is a tool that I am super excited to show you that I built pretty recently about tracing lineage in ML pipelines.
And the goal of this exercise here is to demonstrate what can go wrong post-deployment and how to troubleshoot bugs. This is hard, for one big reason diagnosing failure points in a pipeline can be really challenging, especially for Agile teams or teams with multiple stakeholders when these pipelines are really complex. And I’m sure if you work on a team with data scientists, engineers, PMs, all collaborating on the same applied ML task, you have totally experienced this before.
A little bit more about the data set used for this exercise. It’s a tabular data set where every record corresponds to a taxi ride. And there are several columns in these tables, pick up and drop off times, number of passengers, trip distance, the pickup and drop off location zones, and the fare, toll, and tip amounts of the ride. It’s stored as monthly CSV files from January 2009 to June 2020. I do not know, maybe they added, this being in the last few weeks. But every month is about a gigabyte of data. So, you can imagine that this is pretty massive in itself. And then there are at least like a million records per month.
A little bit more about the ML task specifically. Maybe, there’s a lot of notation here. But the bottom line is just we have features, we have labels. This is a binary classification task as you can see. The labels are either zero or one and a positive label corresponds to the rider tipping more than 20% of the total fare. I picked 20% super arbitrarily, so just for a toy example here. And we want to learn a classifier that hopefully doesn’t overfit.
And we just used a RandomForestClassifier with super basic hyperparameters. Maybe max depth is a little bit more than it needs to be, but I’ve just pulled us off some Medium post, and hopefully it will go along with it. And the code all lives in a public GitHub repository that I’ve kind of created called toy-ml-pipeline, an example of an end-to-end pipeline for this ML task.
And I’ll describe it a little bit here with this diagram. Essentially, we have five stages. The cleaning stage, the first stage literally reads the raw data from US public S3 buckets and transforms the data into a clean version. Feature generation stage computes a few features on top of the data and saves that somewhere. The split stage splits these features into train test sets depending on the time windows that a data scientist might want. The training stage, trains a model on the train set, evaluates it on the test set. And finally, the inference stage, essentially is a wrapper around model.predict.
And it’s two parts in this pipeline are two files. There’s one Flask app that actually has an endpoint that is a wrapper around the model.predict function. And then there’s another file that repeatedly sends requests of features to that Flask app. And this kind of simulates like, “Okay, maybe this is like an applied ML API.” And it stimulates sending a request for a certain prediction and getting a response with that prediction. So, with that, I will talk a little bit about each of these stages just to get everyone up to speed on full familiarity.
First things first, about general utilities. We have an S3 bucket but it’s not the New York taxi cab S3 bucket. The S3 buckets on the README of this kit have, I think it’s like toy-ml-pipeline just as is. And that’s where we have all the intermediate storage of all of these stages. So essentially, after each one of these stages, we write to this S3 bucket that I’ve created. And then the next stage, we can read from that same S3 bucket.
I also built a very small io library, which essentially dumps versioned output of each stage every time that stage is run. And then of course, if you have dumped the version output, you can load the latest version of the outputs to use as inputs to another stage. And then we serve predictions via Flask application where the code is pretty straightforward here, loading a model. And this utilizes the io library to load the latest version of a model. And we have this predict route, which essentially is the request, reads it in, reads in the data, the feature makes a prediction, and then sends it back as the results.
Cool. I hope people are on the same page. Unfortunately, since this is a recording, I won’t know. But we will move forward. So, the first stage of the toy ML pipeline, which utilizes these utilities, is pretty much the ETL, which is the cleaning and the feature generation. So, a little bit more about the cleaning. It reads in the raw data, as I mentioned before, for every month where basically every month is a different CSV file. And it removes, it’s very basic, it just removes rows with timestamps that don’t fall within the month range. And then also reduce rows with zero-dollar fares. So, maybe it’s like somebody who requested a taxi ride but didn’t actually follow through.
The feature generation stage is run directly after the cleaning stage. And that computes 11 features or so. The trip-based features, so for example, how many passengers were in the trip, the trip distance, trip time, trip speed. Pick-up based features, so things like, what was the week day, what was the hour, what was the minute, was it during working hours or eight to five or something during the weekdays, I don’t know.
And then, basic categorical features, such as the pickup location code and the drop off location code and the type of ride. Cool. So, after ETL, which is essentially a bunch of data engineering logic, we have an offline training and evaluation stage, which essentially what we want to do is train on January 2020 and validate on February 2020. Well, first, I picked these months, because they just seemed easy, the first few months of 2020. But also, we get a really nice canonical example of what happens if you deploy in March 2020, which coincidentally is when the pandemic started, and definitely the data shifts as a result of that.
For our metric, we measure the F1 score, which is basically how… The metric is a way that we want to evaluate how good our model is, which is a combination of precision and recall. And at a very high level, a higher F1 score is better. We want to have low false positives and low false negatives.
So, what are the steps here for offline training and evaluation? As I mentioned in the diagram before, we want to split our features into a training test set. And then we want to follow up with some sort of model training and evaluation.
So, enumerating them as follows: One, we want to split the output of the featuregen stage into two different files, flat files. And then, we want to train and validate the model in a notebook. I just used a notebook here, because in my experience, data scientists have trained models in notebooks before. But of course, you can train it in, I don’t know, a regular Python file or wherever you want.
And once we train and validate the model in a notebook, we can take those parameters or we can take that training technique. We can train the model for production in another notebook or the same notebook. So that’s what we do here. And the production model will be the same parameters as the previous model, but we’ll train it on February 2020, so that we can deploy in March. Cool.
So here is a big code dump of some of the training test, split code. Essentially, it’s pretty straightforward. Define our train and test months, load the features for each of these months, and then we can just save them as our train flat file in our test flat file. And as you can see here, my print out where I save them. And here we have our, oh save these links, nice. Here we have our flat files for the train and test set. And then we can use them in the following stage, the stage after split, to actually train a model.
So, we specify our features, specify our label. And so, our model parameters in hindsight, I probably should have added a random seed, but this is where we are. We create a random forest model. And this is just a wrapper I have around the random forest that define some extra helper functions, as we’ll see soon. But we train the model on the train data frame. But the specific label column, we evaluate it. So, this score function here is basically a wrapper on top of computing the F1. We score it on the train set and the test set. Then the wrapper also allows us to soar dictionaries of the metrics.
So, we store the train and test F1 scores, as you can see here. And they look pretty similar, which is nice. I mean, they’re not stellar, but they’re not 50%. So, let’s all move forward for the sake of moving forward for this talk. But the whole point here is that, we have gone through all of these steps. And that’s pretty much the bare minimum of steps you might have in an applied ML pipeline. People will have way more like data engineering, way more cleaning criteria, chain their feature generation logic. Maybe they’ll even have multiple models that are chained together.
Right now, as you can see, we just have one model. But here we have the bare minimum of steps that you need to get to training a model and that itself can be quite complicated. So hopefully, this motivates a need for like, “Okay, well, we’re going to need some production monitoring and tracing tools here as after we deploy it, just to know where it’s been honestly, where a data point has been before the output is produced.”
But, yes, so we have our output here from the training stage. And then, we can train our production model before we promote model to production. So, since we’re simulating deployment in March 2020, we want to train on February 2020. And in practice, here in this talk, we’ve kind of glossed over training or hyperparameter tuning, or checking robustness a little bit more. In practice, we want to train models on different windows or do more thorough data science. But I mean, here, we just have an hour-long talk, so I’m kind of charging forward.
So, we train a model on February 2020, with pretty much the same parameters that we explored before. Loading the features from February 2020. trending the random forest on top of it with the same model parameters that we had before. And we add the train score, which is essentially the same score and the test set in the previous in the offline stage. But now we have this, we save our model. Here again, I used my io utility saving version outputs. So here we have it, we trained to production model, and we’re ready to “deploy to production.”
And the deployment here, I’m going to show a demo. But essentially, before I get to that, let me describe at a high level how the code is structured. We have two files, as I mentioned before, one is a Flask server that runs predict on the features passed in via request, which is I showed you a snippet of code in the previous slide. And then, we also have our another Python file that just repeatedly sends requests to the server to run inference on.
And for the monitoring demo here, I essentially log metrics in the Flask app via Prometheus and then visualize them with Grafana. And what I’m going to do here is essentially start up the Flask server, run the inference.py file, which every second or so send in requests to get predictions, and then maybe in 10 minutes or so, we’ll be able to look back at the Grafana dashboard and see the metrics that are being logged live in real time and kind of see how that changes.
So, let me jump to that. Cool. So here, I am running the inference file, which might take a little bit of time to start up. But essentially, I’ll send in batches of 256 examples to a server that I have started. And basically, I just ran Docker Compose up here to get it to work. All right. Cool. So, it is sending. It is sending examples to the server. Excellent. And we want to start our visualization. Oh, man, I’ll never remember these. It’s perfect.
And then, once we’re in Grafana, we’ll have to add our data source, which is pretty easy. I’m using Docker Compose, and I named my container example-prometheus9090. Perfect. And then, I also have a nice dashboard to import. That’ll be via grafana.com. There, upload json. Then we’ll just do it ML Task Monitoring. Cool.
And here we have our dashboard. And as you can see, since I just started sending requests, we don’t have that much data to see. So, we can revisit this in a few minutes, once I get past a few slides. But essentially, here, I have two types of monitoring when it’s like the generic Flask monitoring about, I don’t know, there are requests coming in and timing for that. And then, I have ML specific monitoring around the outputs of the model here, like the average predictive value.
So just as a sanity check, everything should be between zero and one because our prediction is going to be some probability of a high tip. So, we can kind of see the outputs as they change over time. I’ll talk a little bit more about these, the distribution and the predictive value percentiles, when we actually have enough data to analyze. So cool, that’s been started. So, now that the demo is running, or the server is up and running, and it’s getting some requests, let’s wait for enough requests to come in before we analyze anything.
So, in the meantime, I’ll talk a little bit about the challenges that we’re going to experience post deployment. So, one being the, we have limited time in the session. So maybe, I’ll pretend to fast forward to the end of March 2020. And figure out, all right, in a perfect world, we have our labels. If we have the labels, can we figure out kind of, can we measure our performance in real time? It’s not a perfect world in most worlds. But here, since this is historical data, we can actually do that.
And so, if you compute the F1 score over time in March… So, there’s two ways I compute it. One is from March 1 to March, whatever, so March 3 or 28th, or 29th, what was the F1 score, but over that period of time. So, I called it the rolling F1 score. And then, I also measured daily F1 score here, which is essentially, if you just take the records that occurred on this specific day, what was the F1 score? So obviously, for the first day, it should be the same, March 1, beginning to the end of March 1, and then after that, it can diverge.
So here, if we inspect them a little bit further, we can see that there’s a pretty big discrepancy, at least, for the last several days between the rolling F1 score and the daily F1 score. And since we know about COVID, we could already expect there to be some performance degradation or performance drift. But the rolling F1 score doesn’t quite tell us how significant the drift is. Keep in mind that the train and validation, F1 scores were around 73% and here we are at 66.
So, there’s definitely a drop, but it’s not as bad as if we were to measure the daily F1 scores, at least specifically towards the end. So, bottom line here is, even when monitoring, even if you have access to your labels, what you monitor is quite important, right? The daily metric significantly drops towards the end. And you can kind of see that effect of COVID on these taxi rides. So presumably, people were not tipping as much or their behaviors change as a result of COVID.
And this motivates another question, right? Which is like, if you don’t believe me, the model’s performance is degraded? Can you evaluate that same model on following months? So, April, May, and June. So, I did that and the performance does get significantly worse. So, I don’t think there’s any way around it, right? You have to retrain your models periodically to reflect the latest data that you have.
So, one challenge here I briefly touched on is that, a lot of the times, you don’t really have access to the labels or the data, the most real time data at any given point, because there’s some form of lag. So, in this example, we’re actually working with historical data. So, we have all of the data for March 2020. But in practice, where you can experience a lot of different types of lag, one being feature lag, which is our system only learns about the ride well after it’s occurred.
So maybe, somebody took a taxi ride, and the taxi driver, car, or the taxi, I don’t know, like in the middle of nowhere, and it took several days before they got back online, and they were able to send their information to the New York taxi cab, I don’t know, or organization. So, in that time, there’s a significant amount of time that’s passed before we even learned about the data as the people doing the analysis.
And another type of lag is, even if you get all of the features, maybe it comes at different points in time. Or maybe you have some form of label lags. So maybe we get the features. We get the number of passengers and we get the pickup time and the drop off time. But maybe, we only learn about something, like the fare or the tip well after we’ve gotten the other features or while after the ride has occurred, which makes it really hard to evaluate our model in real time. The evaluation metric will inherently be lacking.
And it also poses other problems, right? So, one is we’re not going to be able to train on the most recent data if we have lag going on. But there’s still other things that we can monitor, as I’ve kind of shown you in the Grafana dashboard. We’re not monitoring anything with respect to labels. We’re just monitoring our model’s outputs. So, when we do monitor our model’s outputs that poses another challenge, we could experience distribution shifts. So, data “drifts” or changes, and the nature of the data can change over time. And the models will need to be retrained to reflect that kind of drift.
So, one big open question is, how often do you retrain the model? And retraining isn’t super straightforward or easy. It adds complexity to the system, because every time you train a model, we have more artifacts to version and keep track of. Every time you retrain, it can be expensive in terms of compute. It can also be expensive in terms of the resources that it takes out on people. Every time you make a pipeline more complex, fewer people are able to understand what’s going on and which is also a problem, especially for Agile teams. So, how do you know? One thing that we can actually ask, say is, “Okay, well maybe, we’ll just retrain a model when the data has drifted.” And then we can instead answer the question of when is the data drifted.
So, let’s go back to our demo. Hopefully, there’s been enough predictions. Great. So, going back to see our dashboard. Here’s a screenshot of… I practice ran it yesterday, in case you couldn’t get it, but it looks like we actually are able to see it, which is amazing. So, you can kind of see the outputs and their distributions over time. So, one thing that’s interesting is our average predictive value has kind of dropped quite a bit. And by average, it may not be the best thing to monitor. I have a lot of other things going on.
So, let’s start by describing what’s in the panels. We have the predictive value percentiles, which is essentially, I picked a few fun percentiles, ranging from one to 99. And the median is this guy right here. And you can kind of watch the median, for example for time so that stays relatively stable, which is interesting. It’s very Interesting that the one percentile changes a lot over here.
We have the 99th percentile also changing, not as much as the median, but changing. So, it looks like the bottom percentiles here are kind of dropping over time. If we want to look at the distribution of predictions, consider this like a probability function at any given point in time. The numbers here, the values of the functions here should sum up to one. So, the most likely your prediction at this given point in time is something that’s around 60%.
And here, we can see that the blue one, the 65, goes down, the 60 goes up. So, it’s interesting to monitor, right? I think, from looking at this, and keep in mind my batch size was 256. So, I’m sending around 256 examples per second. So, you can see that there is some form of drift. Yeah. Look here, you can see that there’s some form of drift. But the question becomes like, “Is this normal? Does this require retraining the model?” I think there’s a lot of interesting questions to ask there. Maybe this could be very normal, since I’m running this in chronological order. Maybe it’s just early morning rides have higher tips. And then, late morning rides have lowered tips or something like that. We don’t know.
So, one solution is to have an expert on this ML task, sit here, and monitor these dashboards forever, right? They will develop all the intuition whatsoever to figure out when to retrain the model. But that’s quite expensive. It doesn’t scale well. If you want to add another data scientist, that knowledge transfer is annoying. It definitely doesn’t work on Agile teams, whenever you want to empower all the people on that team.
So, this dashboard doesn’t quite give us exactly the answer of when to retrain a model. And maybe, we can talk a little bit about that as I get back to my slides. Okay, so maybe if we can’t visually look at our outputs. Or you can imagine monitoring the features too. We could maybe use some statistical tests to figure out how do you know when the data is drifted. So, in this example, I applied methods from the Failing Loudly paper, which essentially takes two feature distributions, what you’ve trained on, and what’s happening live in real time, does dimensionality reduction, compares them via a two-sample test, and then looks at the P values to see whether it’s statistically significant or not.
So, what we’ll do here is just apply that method pretty directly, except for we’ll skip the dimensionality reduction, since we only have 11 features. I think they really did dimensionality reduction in the paper, because they were applying this specifically to image data, which if you were even working with 32 by 32 image. That’s already a lot of features.
So, in the paper, they noticed that multiple univariate testing is pretty much the same performance wise as multivariate testing, which all that means is essentially for each feature will run a statistical test versus trying to run a statistical test on a group of features. And maybe, oh, there’s some caveats. Maybe this isn’t specific to their experiments on the image data sets, MNIST and CIFAR. I mean, we’ll just apply what they have in the paper. We’ll employ our multiple univariate testing, which is basically a test for feature and kind of see how that works.
So, for each feature, we run a two-sided Kolmogorov-Smirnov test or Chaos test. I know it sounds super fancy, but a little bit about it is that it is a nonparametric test for continuous data. And it looks complicated, but essentially, let me dive into it a little bit. All this means is, imagine that you have two distributions P and Q. Imagine this is a distribution, like PDF or something. And if you integrate that you can get the CDF or the cumulative function. And the test statistic is essentially the largest difference between the two CDF.
So, what this means is, imagine I have two normal distributions. And if you’ve seen a normal distribution before, you kind of know it’s like a bell-shaped curve, so that’s the PDF. If you integrate that, you’ll kind of get something like an S-shaped curve. And so now, we integrate both of the two normal distributions and we get two S-shaped curves. And the Chaos test statistic is the largest difference along the Y axis between the two CDFs.
So, if you look at this graph here, and imagine I’m running a vertical line through the graph. The largest difference between the two curves kind of happens when X is -0.5, maybe. Oh, maybe it’s zero. I don’t actually, maybe it’s like right here, maybe it’s -0.25. But this gap between the two is the Chaos test statistic. And we pick the largest one, we don’t pick the smallest gap. And this largest gap, if it’s very large, then we can include like, “Okay, these distributions are pretty significant.” And if the two distributions are the same, you will have zero gap between the curves. And therefore, the test statistic is nothing.
So, what we can do is, we can, for each feature, we can plot the CDF. And we can compare the train set and the live. We can compare the February 2020. We’ll do this offline. We can compare to January 2020 feature distribution to the February 2020 distribution. And since we did that, an offline step, and we saw the F1 scores and like, “Oh, they’re fairly similar.” You shouldn’t expect the distributions to be too different. And the way that we actually compare them is we get the test statistic, and then we can obtain a P value by looking at the test statistic in comparison to the number of samples in the data. Fortunately, we can do all of this using classic SciPy or something. I tried to implement it in Spark at my previous company and it was quite tedious, but it works.
But yeah, so let’s dive into that. Let’s look at it. So, if we compare the January 2020 and February 2020 data sets using that test, so the test statistic is in one column and then the P value, which is a statistic in relation to the number of data points. So, each feature has a different statistic, which makes sense. And the statistic alone, when you look at it. I don’t really know how to interpret 0.046 to be completely honest. That’s where the P value comes in handy.
But the problem here that we see is, look at all of these P values and how incredibly low they are. One of them is 10 to the negative 258. And we were comparing two distributions here that we believed to be fairly similar because the January and February 2020 F1 scores were super similar.
So, it goes like why is this P value so small? This is not helpful to us, in a sense. And we can dive into that a little bit more, right? So, we get extremely low P values. And this method can have very high “false positive” rate or flag us, like, “Hey, the distributions are different,” even when they might not actually be different.
And in my experience, I think, especially working with large amounts of data, this method has flagged distributions that is significantly different way more than I want to. And if you’re monitoring something like this, where you can easily get alert fatigue, if you’re getting like an alert every time you just have a large data set. And in the era of big data, where we literally have millions of data points in each of these data sets, or each of these month data sets, the P values are not quite useful to look at. Because just if you have a lot of data, your P value will just be small.
So, I mean, I’ve gone through this whole exercise, and it’s like how do you know when the data has drifted? And I think that’s a very interesting area of statistical research and just research in general. But if we want to apply ML today and right now? You don’t know when the data has drifted. That’s the bottom line. If you want to be super concrete, right? If you want to be confident that your data has drifted, unfortunately, at this point in time, we don’t know. And so, we have the pretty much only unsatisfying solution here, which is, how often can we retrain the model just based on our own infrastructure. We want to retrain the model to be as fresh as possible.
And I think there’s some very interesting stats that I was reading recently about Netflix, for example, deploying thousands of times a day. Meaning, they’re retraining thousands of times a day and redeploying thousands of times a day. And maybe things have drifted, maybe things have not. But the bottom line here is, a lot of these companies just want the most up to date model. And this brings to light a lot of very interesting engineering problems that we need to solve, which I’ll dive into a little bit.
So, let’s talk about the production ML bugs that can happen when you are retraining your model to be as fresh as possible all the time. And here are a few that I have personally experienced. There are many, many more. So, don’t get mad at me if I don’t include your favorite one. But there’s a lot of production ML bugs you could have.
One being the data has changed or corrupted. Maybe there’s some data dependency or infra issues, like somebody upstream or a connector had a problem or something that you or your company may not even be in control of. Maybe they’re logical errors, this is very likely to happen in the ETL or even retraining a model. Maybe accidentally you used the wrong data set or you accidentally specified March 2020 instead of February 2020, stuff like that.
Maybe you’ll have a logical error in promoting a new model to inference. So maybe, you accidentally promote a stale bottle or you accidentally forget to promote your most recently trained model. Maybe, there’s a change in the consumer behavior. So, for example, maybe COVID hit and then all of a sudden people are not taking taxis anymore. Who knows, right? I think there are tons and tons, the list can go on. You can do a whole talk on types of production ML bugs.
And why are they particularly hard to find and address in a way that the software bugs, software has bugs, right? But ML bugs are elusive and super annoying for many different reasons. One big one is that ML code fails very silently. Has anybody who’s trained an ML model or a deep learning model, deep learning is the worst culprit here. It’s very easy to get the model to compile and to run, and then pass in your data and then finish the program to see that you have 12% accuracy or something ridiculous. And there were no runtime or compiler errors. Clearly, there’s some bug in the code, right? But there’s no easy way to find it. Because there are a lot of silent failures.
A lot of the times also in production pipelines, there is a very little or no unit or integration testing in feature engineering and ML code, which makes it super easy for a lot of these errors to just slip by, especially when you’re doing feature engineering and SQL or Spark or something. Very few, if any, people have the end-to-end visibility on an ML pipeline. I was so fortunate to be the first ML engineer at my previous company and I know pretty much end-to-end what what’s going on in the pipeline because I was writing all the steps.
But if you’re at a company in which you’re collaborating, or you’re on an Agile team, you’re responsible for one part of the pipeline, not all parts of the pipeline. And so, whenever something goes wrong, and there isn’t a person who just knows the thing end-to-end, it can be very hard to triage the issue. We also have pretty underdeveloped monitoring and tracing tools, I think, specifically for ML as a paradigm of software development. I think it just produces an insane amount of artifacts that are really hard to comb through.
And especially if you retrain your models, it makes things more complex, as I’ve talked about before. So, when you have all of these bugs, especially when you’re in Agile, it becomes important to monitor. And I mean, as I showed you in my demo, you want to monitor especially if you’re doing continuous delivery. This is especially hard when you don’t have labels and you’re trying to monitor for something real time.
And here are my thoughts on monitoring as a very basic first-pass monitoring. You want to monitor the model output distributions as I showed in the Grafana dashboard. Then a way that you can monitor the output distribution is through the average model output or percentiles of the model output, histograms and so forth. And then for more advanced monitoring, you can go towards monitoring the input or feature distributions. You can even monitor the ETL intermediate output distributions. So maybe, monitor the clean data, as well as the distribution of features. So maybe, if the clean data distribution doesn’t change, but the feature has changed, maybe there was like a bug that was added to the feature generation code, I don’t know. But I mean, you can monitor everything, anything you want, pretty much. But it can get tedious very quickly, right? If you have thousands of features, which is actually very common.
If you have a kitchen sink data science approach, where they just create all the types of aggregations they could, it can get very tedious to monitor all of these. And it’s still unclear, right? Even when you’re monitoring, as I showed you. How do you quantify drift? How do you detect the drifts? And how do you act on the drift?
So, having monitoring is important. We’ll talk a little bit about the Prometheus monitoring that I just did. So, there are a lot of solutions that you could use, I think from monitoring. There’s a lot of startups working on ML specific monitoring. So, I can’t speak to them, particularly because I’ve never worked there or really used it. But I can speak to rolling your own monitoring, as I’ve done here, with Prometheus and Grafana.
Some pros of Prometheus monitoring are that, it’s pretty straightforward to call just like observe. I’ll show you the Flask app code in a second that uses the Prometheus library. But it’s pretty straightforward to get a predicted value, that’s val. And then you can just call histogram.observe. It’s pretty easy to integrate with Grafana because they’re Prometheus connectors Grafana. A lot of applications already use Prometheus for infra monitoring. So, it’s nice to not have to switch to a new tool.
But there’s also pain points for Prometheus monitoring. So, maybe it’s easy to call histogram.observe, but then the user actually, for ML specific tools, the user needs to define the histogram buckets, which requires data scientists to collaborate with whoever is building these dashboards. It’s easy to integrate with Grafana as I mentioned, but if you are trying to monitor five different quantiles or percentiles as well as a mean for every single feature, intermediate output, you can get a ton of clutter. And it’s hard to keep track of all of these metrics.
And even though many applications already use Prometheus, this is a personal opinion, But PromQL, I think is not very intuitive for ML based tasks. It makes a lot of sense, calling the rate function. For example, it makes a lot of sense when you’re trying to monitor latencies and a lot of info around these kinds of API’s. But it’s much harder to kind of frame ML monitoring using PromQL as a language. At least, it took me a little bit of time to wrap my head around it.
So, diving into a little bit about what I actually used for the queries. So, I showed you three pretty big queries that I was running in the Grafana dashboard. But essentially, to get the average output, you could see not super intuitive. But imagine, you’re using a histogram object to monitor the metric. And you can essentially run this query which is a sum over the rate of count.
The median output, you can use the histogram quantile function in Prometheus. And of course, if you want the 99th percentile, you can change this to 0.99. And then probability distributions are a little bit more non-intuitive, but essentially, you want to compute the rate of the bucket and change to the heatmap view instead of the timeseries view.
And if this doesn’t make too much sense to you, just like feel free to look at the dashboard.json that I have in my GitHub repo or reach out to me, my contact info is here. But I also want to show you a little bit about the Prometheus logging code in the Flask app. So, let me switch to that really quickly. So, here is the Flask app that I’m running. As you can see here is the predict route and the predict function. I know there’s a lot of other clutter, but the only thing I’m pretty much doing for Prometheus in the code is creating Prometheus metric object of histogram type. I put the description.
And then as I mentioned, in the pros and cons, you do need to manually specify the buckets, specific to your ML task. Because otherwise, then I’ll just use the generic Prometheus histogram buckets, which are designed specifically for API request times, which is not necessarily what, I don’t know, like Softmax probability, or even just any probability would be.
So, in here, I use buckets from zero to one, spaced out by 0.05 or 5%. So, I manually specify this and then it’s pretty straightforward to log just basically for every prediction. Here, I just call the histogram object.observe flow of prediction. And then you can run the PromQL queries that I have in the slides. So, if you have a dashboard, maybe it’s like a monitoring solution that you’re paying for or maybe you’re rolling your own Prometheus and Grafana, whatever.
Another thing that I think is super necessary, especially when you’re retraining regularly, is to have some form of tracing. Supposed you retrain your model on a weekly basis, or it could be any cadence. And there are ways to kind of streamline this, one being to use an MLFlow Model Registry to keep track of which model is the best model. Or what I did in my demo is just manage the versioning of the registry myself. But I think it’s much easier to use MLFlow Model Registry especially when you have multiple prediction tasks. I used MLFlow Model Registry before, so I can speak to that being helpful.
But for this toy pipeline, I just managed the versioning myself because there’s only one task. And when you have all of these models at inference time, essentially, you want to pull the latest and best model from the registry. So even if you’re retraining the model weekly and you’re continuously running inference over months at any given point in time, you just want to be able to pull the latest or best model after you’ve trained it.
But when you have all of these artifacts and especially when you’re working in a collaborative environment or on a team where you have an end user interacting with your outputs, it can become really, really hard to trace back. If the end user has a problem with some of the outputs or there’s a week of predictions that you just seriously flopped on, how do you identify what produced that output, especially if it occurred in the past? And I think there aren’t that many tracing tools out there. I mean, honestly, maybe there are no tracing tools out there specifically designed for ML in mind. And that’s kind of what I’ve been working on since I left my most recent company.
So, I built mltrace, and the first version is pretty much out on GitHub. So please check it out. And the role of mltrace is to do coarse-grained lineage and tracing for ML pipelines. And it’s designed specifically for complex data or ML pipelines in which you have multiple components of the pipeline. In our Toy ML pipeline, we only have five, but several companies have way more than five. So, it’s designed specifically for complex pipelines and also specifically for Agile or multi-disciplinary teams, where you have people owning different parts of the pipeline, like you have data engineers, ML engineers, software engineers, data scientists, and all of them are collaborating on the same tasks.
The current release, well as of May 3. So, maybe things have been added since then. The alpha release contains a Python API to log run information and a UI to view traces for outputs, which I’m super excited to show you.
And let me talk a little bit about the design principles before I jump into how to use it. But here’s kind of how I approach designing it. I prioritized simplicity. I want the user to know everything that’s going on. It should be super straightforward to use and it shouldn’t do anything too smart so anyone can use it. And there’s no need for users to manually set component dependencies.
So, for example, if my feature generation stage reads from my cleaning stage, I shouldn’t have to manually set a dependency if it’s reading from the same file that the cleaning stage output. The tool should be able to detect the dependency based on resolving I/O as it flows throughout an entire pipeline. So, the API is designed for both engineers and data scientists and the UI is designed for pretty much anybody to help triage an issue, especially if they didn’t build any of the components themselves. So, if they weren’t the data engineer doing basic cleaning logic, they should still be able to go and figure out what are the files and what was the code used in the data cleaning two weeks ago? And then run some basic EDA on those files to figure out if there’s any glaring issues.
So, the UI is designed for people to help triage issues, even if they didn’t build the ETL or models themselves. This is particularly useful, I think, for collaborative teams or Agile teams. For example, when a PM receives the bug report, whose backlog do they put it on, right? Do they put it on the data scientist or the data engineer? They can’t trace the bug got to where plausibly could have happened. And the bottom line here is we want to enable people who may not have developed the model to investigate the bug, and work in a more collaborative way.
Some of the concepts specific to this library are as follows. So, any data pipeline or transformation, it’s made of several components. For example, for the Toy ML Pipeline as described here. There’s cleaning, there’s featuregen stage, split stage, train stage, and so forth. And the motivator here is that in ML pipelines or in complex data pipelines, different artifacts are produced even when the same component is run more than once.
So, for example, every time I run the feature generation stage, the inputs to the feature generation stage are different and the outputs are different, based on the time that it was run, right? So how do we log this and centralize this in a more effective way, that’s where mltrace comes in. The two abstractions that mltrace uses are Components and ComponentRun. So, a Component of something like cleaning or featuregen, and a ComponentRun is an instance of one that was run. The documentation will describe this further.
And there’s two ways to use this kind of logging API. One is just to use a decorator interface, which is super similar to Dagster “solids”, if you’re familiar with that. So essentially, the user specifies the component name, input and output variables in the decorator. And the decorator essentially wraps a tracer around it and logs the value of the variables in that function. The input and outputs essentially, is what you define.
And there’s also if you want to do it yourself, there’s an alternative Pythonic interface similar to MLFlow tracking. So essentially, you create an instance of a ComponentRun object with all the start time, the end time, the inputs and outputs, and then you call log_component_run, which logs this component run and does all the dependency resolving behind the scenes. So, to integrate that into the code, it’s pretty straightforward. Imagine you have a function called clean data, which takes in your raw data, file name, and other things.
So, the inputs here to this component are just the raw data file name, and the outputs of this component or the output path. So, the way that I specify that to mltrace is I put this decorator right here, this is the cleaning component. I have raw data file name as the input variable and output path is the output variable. And I just put this decorator on top so that when the component is run, everything gets logged accordingly. And so, when I run this function, the mltrace specific things, running the imports, creating a cleaning component with a description and an owner, and then optionally tagging it with a stage or something. This is for the UI, in case you’re interested in viewing components by tag, and then you can just run the clean data function, which has the decorator on top of it, so all the logging is done.
And let me quickly jump to the UI demo on how to view a trace. So, this is the UI and I’m serving it via EC2 Instance. And you can kind of see some of the commands in the Help here. But we want to do is trace an output for example here. So that’ll run this trace and output ID. Let me move my Zoom panels so I can explain it. But as you can see, these pipelines are already pretty complex, this is an idea of an output from the Flask API that we were just serving. And so, we have this output ID. We can see that there’s so many steps just involved in generating this output, right?
There’s the inference step and you can see a code snapshot. There’s also a Git commit. So, all of that exists. But that depends on the feature generation for the data that was fed in, right? And that feed also depends on the model that was trained, which also is dependent on another feature generation stage, which also is dependent on another cleaning stage. And this model was dependent on the parameters from the train model. And you can see it gets very complex, even if we’re running such a simple toy pipeline.
And I think that’s the bottom line of why I built this tool, because when I was at my previous company, I could not easily find like, “All right, what was the clean data that was fed in to the Flask app?” No idea, especially if I didn’t build it. But using this tool, you can pretty much immediately find out the file to go look at.
And so that’s what makes me really excited about it. And you can check out the docs at the GitHub repo. It’s fully open sourced. Yeah, let me go back to the slides.
So, moving forward, on the immediate roadmap, I’m planning to integrate something to the UI to very easily see if a component is stale. And there are two types of staleness that I think are immediately viewable, which is one, there’s a newer version of that component outputs. So maybe, the train component had a new component run, for example. Or maybe, the latest run of a component happened weeks or months ago. Like, the model was most recently trained three months ago. It would be nice to update that, but a lot of times, if you’re working on teams, you might not have the resources to immediately know that a model was stale.
In my previous company, there was a model that I built that I had not updated in four months, and nobody had any idea and I was the one who built it. So, I think it’s super easy for these kinds of things to slip people’s minds. I’m also working on a CLI to interact with the logs. Instead of just using the UI, it would be nice to use the CLI. Right now, it’s very Pythonic. But Scala Spark Integrations are on the roadmap, essentially having some sort of REST API for logging. So even if you don’t use Spark or Python for any steps in the pipeline, you can pretty much use mltrace. Also, it’s on the roadmap to integrate Prometheus or some other tool to easily monitor the output distributions.
So, for example, if you’re already logging your outputs to mltrace, it should be super straightforward to capture those values and to show you the distributions in a reasonable way. I’d love to also work on other MLOps tool integrations, specifically around monitoring or specifically around version control. So, e-mail me if you’re interested in contributing. I think there’s a lot of stuff to work on.
So, to recap or summary. In this talk, I introduced an end-to-end ML pipeline for a toy task. I demonstrated performance drift or degradation when a model is deployed via Prometheus metric logging and Grafana dashboard vis. We showed that via statistical tests, that it’s hard to know exactly when to retrain a model. So, motivating that, you should just try to retrain your model as much as your infra could possibly handle, there are challenges around infra, around continuously retraining the models and maintaining these production pipelines.
Finally, this motivated my own tracing and lineage tool mltrace, a tool developed for ML pipelines that performs coarse-grade lineage and tracing. And I know I demoed just the tracing but there’s also a history command to show you all of the component runs for a specific component. So, I encourage you to check it out. Please reach out to me. I’m pretty excited about it.
Some areas of future work in the field. I think there are a lot, but I’ll list a few here. I really only scratched the surface with some of these challenges. And I think there are way more challenges around deep learning specifically. For example, if you use embeddings or deep learning output models as features, if somebody upstream, updates the embedding model, do all of the downstream data scientists need to immediately change their models? Something like mltrace could use some further work or applications on top of that to alert or notify people who have dependency issues.
Another very interesting phenomenon is the underspecified or underdetermined systems also. Under specified pipelines could pose threats. So, for example, if you have a train and test set, offline, metric match, then you deploy it into the real world, you can still experience performance degradation due to some interesting linear algebra phenomenon.
And also, how do you enable anyone to build a dashboard or monitor pipelines? I showed a Prometheus Grafana integration. And for ML, ML people know what to monitor and the infra people know how to monitor. And so, it’s really annoying whenever you need to have everybody in the room to be able to build this specific tool. It’s just hard for people to be self-sufficient. And we should be striving to build tools that allow engineers and data scientists to be more self-sufficient.
So that’s really all I have for today. The code for this talk is @debugging-ml-talk. Our version of the slides will be posted there and then you can see all the repositories that I’ve used. The toy ML Pipeline is here, it’s on specific branch. And this has all the Prometheus and Grafana instructions on how to run it if you’re interested; mltrace, specifically, is here, actively under development, which is exciting.
So, if you have any questions or want to reach out, you can reach out to me on Twitter or you can e-mail me at shreyashankar@berkeley.edu. I’m starting my PhD in the fall, working specifically on these kinds of tools. So, I’m super excited. And with that, thank you so much, and I hope you have a great rest of the conference.

Shreya Shankar

Shreya is a computer scientist living in San Francisco interested in making machine learning work in the “real world.” She is an incoming PhD student at UC Berkeley. Previously, she was the first...
Read more