Advanced Model Comparison and Automated Deployment Using ML

May 26, 2021 05:00 PM (PT)

Download Slides

Here at T-Mobile when a new account is opened, there are fraud checks that occur both pre- and post-activation. Fraud that is missed has a tendency of falling into first payment default, looking like a delinquent new account. The objective of this project was to investigate newly created accounts headed towards delinquency to find additional fraud.

For the longevity of this project we wanted to implement it as an end to end automated solution for building and productionizing models that included multiple modeling techniques and hyper parameter tuning.

We wanted to utilize MLflow for model comparison, graduation to production, and parallel hyper parameter tuning using Hyperopt. To achieve this goal, we created multiple machine learning notebooks where a variety of models could be tuned with their specific parameters. These models were saved into a training MLflow experiment, after which the best performing model for each model notebook was saved to a model comparison MLflow experiment.

In the second experiment the newly built models would be compared with each other as well as the models currently and previously in production. After the best performing model was identified it was then saved to the MLflow Model Registry to be graduated to production.

We were able to execute the multiple notebook solution above as part of an Azure Data Factory pipeline to be regularly scheduled, making the model building and selection a completely hand off implementation.

Every data science project has its nuances; the key is to leverage available tools in a customized approach that fit your needs. We are hoping to provide the audience with a view into our advanced and custom approach of utilizing the MLflow infrastructure and leveraging these tools through automation.

In this session watch:
Charu Kalra, Senior Data Scientist, T-Mobile
Connor McCambridge, Senior Data Scientist, T-Mobile

 

Transcript

Charu Kalra: Hi, my name is Charu Kalra and I’m a data scientist here at T-Mobile.

Connor McCambri…: And I’m Connor McCambridge, also a senior data scientist at T-Mobile. And we’re excited to present to you our presentation for advanced model comparison and automated deployment using MLflow, here at the 2021 Data and AI Summit. The reason we selected this topic is that in the past we’ve seen presentations and demos highlight some of the key features of MLflow Model Registry. We thought it’d be beneficial to show an example of how our data science team is using those tools in production to fit our unique needs. We hope this will give you a better understanding of how these tools can be integrated with your own data science solution.

Charu Kalra: Connor and I are part of the fraud analytics and insight team here at T-Mobile. We as a team, bring our unique skills and expertise to solve complex business problems, using statistical analysis and data science techniques. Previously at Sprint, some of the capabilities that we leveraged were real-time decisioning, model score carding, and advanced automation using open source technologies like Spark and Python. As we merged with T-Mobile and founded a new data science team, we decided to make the jump and move into the Azure Cloud environment and utilize Databricks to implement our own data science vision. But we would make up only two thirds of this team. And we would be remiss if we didn’t mention our third team member, Ted.
During this presentation, we’re going to take you through the first problem we tackled as a team using Azure capabilities, which was to identify fraud inside the delinquent new account status. First, we will walk you through the project vision and how we began to address the model building aspects of the project. Then we will show you how we were able to achieve all of our model building needs leveraging MLflow Model Registry and Azure Data Factory through our solution design. We will give you a preview of our demo, of how our model comparison is implemented, and how we were able to productionize the models to automation. Then we will conclude this talk with some high level results of the project and key takeaways of what we have learned from working in the Azure environment. But before we get started, we want to take this moment to thank you all for attending this session. Your feedback is much appreciated so please don’t forget to rate and review our session.
Like we stated previously, our first project utilizing Azure was to identify fraud hidden inside the pool of delinquent new accounts. To give you an idea of what this problem looks like, when a typical account is created, a customer provides the required information needed, receives their selected device, after which the device is activated and used. They receive their first bill, a payment is made, and then they continue using their service. During this process, there are various projects that occur both pre and post activation, but if the first payment is missed, the new account falls into a delinquent status. Fraud that has been missed by previous check has a tendency of falling into the same delinquent status.
With all of this in mind, the objective of our project was twofold. First, to quantify the effectiveness of our existing fraud measures by estimating the amount of fraud that is in this delinquent status. Secondly, to add another checkpoint to identify fraud that escaped the previous examinations. In order to investigated the accounts headed to delinquent status, we decided to take three unique approaches to the problem. Random sample, by selecting accounts at random for review, this allows us to make statistical projections to measure the effectiveness of our upstream fraud controls. Machine learning, this supervised approach uses past instances of fraud to build predictive models to identify the probability of fraud. Outlier detection, taking an unsupervised approach to identify suspicious accounts that are out of normal threshold or standard accounts and possibly finding new and emerging fraud trends. Random sample is completed by selecting accounts at random, but for machine learning and outlier detection, we wanted both of these models to be continuously updated, retrained, and re-tuned with new information as it was made available through an automated process flow.
Even though this was our first project, we wanted to be mindful about the amount of effort it would take to support these projects long-term. It was clear from the beginning that it was crucial to have the right process flow in place to support the advanced automation of our model building deployment and daily score carding activities. Here’s a high level overview of how this automated process flow works. It starts with collecting historical data, which is used to build models. The best model is moved into production. From there, daily data is gathered, processed through the production model. Model results are collected. Selected accounts are sent for investigation. Then those accounts are manually reviewed for fraud. Once the investigation is complete, the final results can be used to examine model performance. The identified fraud will then be included in historical data, which would improve future predictions and model training.
The model building step of this process follow flow entails that these data science stages. It starts by combining the relevant data from various sources and then splitting it between training and test, then transforming the data into more consumable format for our mathematical models. After which machine learning and outlier detection models are built using a wide variety of parameters and modeling techniques. The models are examined based on performance metrics and if the new model outperforms the current production model or the same test data, then the new model is productionized. These stages are then [inaudible] into Databricks notebooks to complete our model building process.
As part of notebook building, we first tried to combine all four data science stages into one notebook. The challenge here was the inability to train multiple models. One way to do that was by replicating the notebook and training different models in each. There were a couple of problems with this approach. It was hard to ensure that the same data was being used for training and testing across all models. We would have to process the same data again and again for each notebook. There was no unified selection of the best performing model and each model could be implemented and moved to production. And lastly, if there was a change, then each notebook would have to be updated individually.
Another way to train models was to add multiple model building techniques into one notebook. This too had its issues. Even though this reduces the steps before and after model building, the models would not change simultaneously causing long run times and not using the full capacity of the cloud infrastructure. If a single model needed to be tweaked, then all the models would have to be rerun for comparison.
To solve for these issues in the previous approaches, we decided to move away from single notebook solution to a framework of notebooks instead. With this method, the data preparation and transformation would take place only once ahead of various model building techniques. Model building could be done independently and ran simultaneously, giving us the ability to use full capacity of the platform. If any model had to be reprocessed, then you would just rerun the particular model building notebook without impacting the other models. This allowed for horizontal scaling for additional models. Then the models are compared and registered at a single point to be implemented in production. This entire framework would be automated using Azure Data Factory, making it a fully hands-off solution. Now I would like to pass it on to my colleague Connor, to take you through the framework components.

Connor McCambri…: Thank you, Charu. In order to successfully execute this newly designed framework, we determined that these components were critical for longevity of our project. We had to create and store our training and testing data, build and utilize a uniform transformer for future engineering, train multiple models efficiently and in parallel, hyper-tune a variety of parameters across various models, compare and selection of the best performing model, and automate this entire process from end to end for a seamless deployment.
These components came together to make up this fully automated solution design which we’ve implemented here in production at T-Mobile on the Azure platform. It starts with the data preparation and transformation notebook. With the data and transformers saved, we then start up our two model building lanes, one for machine learning and the other for outlier detection. In each of these lanes, we built multiple models, compared results, and saved the best performing model into production. Production models are then utilized in our daily batch scoring notebook to identify suspected fraudulent accounts. Now we will deep dive in each of these framework components.
Starting with the creating and storing data. Here we built the notebook to gather data from various sources, split the data into training and testing sets and store this data into Delta Lake tables. These Delta Lake tables can then be utilized by all downstream notebooks. By storing the data in Delta Lakes, we can keep the full history of the data utilized in each transformer and model creation, giving us the ability to create models at any snapshot in time.
Once the persistent training and testing data is saved, we build the uniform transformer. By doing this, we build the transformer once that can be used by all the models, saving us time in processing. This uniform transformer is built to support any data changes upstream, as well as the data going into the models. Saving the transformer using MLflow Model Registry makes it easily callable in all subsequential notebooks. With the data and transformer created, we then run the notebooks for training multiple machine learning outlier models. As stated before, the ability to run models simultaneously allows us to build a catalog of models utilizing our various modeling techniques. Some machine learning models we’ve explored are logistic regression, neural network, XGBoost, random forest, histogram-based gradient boosted tree classifier, Adaboost, support vector classifier, linear discriminant analysis. And some of the outlier detection models we have explored are Elliptical Envelope, Isolation Forest, and [inaudible].
For each of these models, we wanted a way to explore a variety of different tuning parameters and save the best performing one. Instead of using the typical grid searching method, we wanted a way to explore a wide range of parameters without having to find each and every value. For this, we decided to utilize a hyper-tuning function named Hyperopt. Databricks ML Runtime comes preloaded with an optimized version of Hyperopt through Sparks and Databricks.
The way that Hyperopt works inside the train notebook is by retrieving the training and testing data, use the uniformed transformer to future engineer the data. Define the models and parameters to be optimized, leveraging a loss minimizing function, build the models using initial parameters on training data, return the loss value of the scored data to the loss minimizing function. This function then uses these results to select the next set of parameters to minimize loss. This process continues for a defined number of iterations after which the variously tuned models are compared. Based on the comparison metrics, the best performing model is selected. The transformer and selected model is combined into a single pipeline and saved for future comparison. This process is repeated for all models examiner for both machine learning and outlier detection.
Once the various models are appropriately tuned and saved to the MLflow experiment, we begin the model selection process. This notebook is a single point to compare a multitude of models while simplifying the process of launching a model into production. It also allows us not only to compare new models, but also to compare with the previous version of production models with updated testing data. With the best model selected for comparison, we then leverage a model registry to automate the deployment of the production models. If the model selection is new, we then save that model into the registry. If the model selected was previously in production, we identified the previous version. The best model is moved to production, and the former version is archived. The production models are then used for the daily scoring job of new data.
To successfully execute our notebook framework we have discussed, we leveraged Data Factory to orchestrate our complex workflow. Data Factory makes it easy to work with Databricks through our dedicated link service, the ability to select clusters for execution, and the functions built specifically for running Databricks. With Data Factory pipeline, we were able to configure scheduling, retry attempts, robust logging and notification to track this automated solution and the ability to adjust the architecture of the pipelines based on the sizing and availability of the Databricks clusters. For example, the design that you’re currently seeing builds four models simultaneously. Through Data Factory pipelines, designing and implementing various Databrick notebooks is extremely simple and straightforward.
To give you a better idea of how our model selection and graduation process works, we’ve created a demo to walk you through the model comparison notebook and its functionality in which we will compare the model that’s in production with a variety of new models, utilizing a testing dataset, programmatically select the best performing model, and productionalize the model automatically using MLflow APIs. The data set we used for this demo is the famous credit card fraud detection data of European credit card transactions captured over two days, September 2013. The dataset was made available through Worldline and Machine Learning Group at the Free University of Brussels. Like most data surrounding fraud, this is an extremely unbalanced dataset, which is representative of a lot of the problems that we face in our actual data landscape.
We’re going to start here at our Model Registry. And as you can see, we currently have a version one in production here for our ML model for this demo. This model ties back to this MLflow experiment, where we can see we have that current model, which ties to the one in production. And so now we want to build some additional models to compare against that production model. So we’re going to go ahead and load some additional models. These models were trained beforehand, but we did this in the exact same fashion as our solution design by first building the data transformer, building the models independently, and now we’re comparing those in one single notebook.
So now that we’ve loaded these additional models, we can go back to the MLflow experiment. We can see that we have five models now to compare the one in production and the four additional models. Next we want to load all necessary packages to complete these tasks. And we’re going to load our testing data. This testing data is the same data that was set up when we were training our transformer. We’re going to load our MLflow experiment and we’re just going to run through the search runs just to make sure that we’re getting all the necessary runs. As you can see by scrolling over, we’re getting all those same classification types as the ones we saw in our MLflow experiment.
So next we’re going to quickly build our ROC plotting function. And then we’re going to build our comparison function. Let me go ahead and kick off running it and then I’ll walk back through what exactly we’re doing here. So we’re just going through that MLflow experiment one by one and taking each run, pulling the model back, running the test data through it, getting the test probability and test predictions, scoring those through various metrics, loading that data into a Pandas DataFrame, and then also using that plot ROC function we just built and plotting those at the same time. Just kind of doing this at the same time so we don’t have to keep on iterating through those models again and again.
So now we can look at our ROC function and see that we’re doing really well. All of these ROCs are very close together. So close, in fact, that I zoomed in on this first square, that’s what you’re kind of seeing here is that upper most quadrant. But besides just plotting those, we also return back a results data frame. So since those are so close there, we’re going to look at the results data frame sorted by a F1 score. And you see clearly that there’s a new model that’s outperforming the logistic regression that’s currently in production and that’s the histogram gradient boosted classifier.
So now that we know that there’s a new best run that we have, let’s go ahead and we’re going to define that best run into a variable. And then we’re going to use this variable to register this model through our model registry. So the first thing we’re going to do is we’re just going to check to make sure our model registry is registered, which it is. And we’re going to pull back the recurrent production run and the current production version. Next, just to double-check, is we’re going to compare our new best model to the model that’s currently in production. I know we’ve already compared against logis egression, that’s in the MLflow experiment, but we’re just going to take another check just in case in the future you want to remove those experiments for various reasons like the logic has changed or that you’re adding different columns.
So now we can look at these so we can see that our new best version is outperforming the one currently in production. We’re having 77 predictions here versus 75 correct predictions. One of the biggest things we’re improving on is the amount of false positives. So we have 23 here with our previous model production version. We have 13 here with our new versions. So now we can see that this model is outperforming the current model in production, let’s go ahead and graduate it to production.
So the first thing we want to check is we just want to make sure this run doesn’t match any other runs that are in production or have been in production. So we’re going to iterate through the model versions again to the MLflow Model Registry and we’re seeing that there’s no versions that currently match this. So there’s no versions that match it, we’re going to go ahead and we’re going to save that model down to the Model Registry. So we’re doing all of this with Logic, just making sure it’s not a production run, making sure this model version is none just to make sure that we’re not doing things that save any models down that have already been saved or that if the model version is equal to production version, there’s nothing that we really need to do.
Now that this model is saved, we’re going to go ahead and we’re going to stage the models. So we’re going to take our previous production version, we’re going to archive it, we’re going to take this new best version and we’re going to save it to production. So after we do that, we can go back and look at our Model Registry. And we can see, in fact that we do have a new model that’s currently in production, this version two, which lines up to this histogram gradient boosted classifier. And lastly, what we’re going to do is we’re just going to remove any unneeded runs from our model experiment. So we’re just going to go through the model runs in production, we’re going to set the ones we’re going to delete and we’re going to delete any runs that aren’t registered. And here we are left with two models left, both of them registered. So next time we’re going to run additional models and compare them against each other, we can compare against all the models that are currently in production.

Charu Kalra: Once we were able to complete our final process design utilizing Databricks and Azure Data Factory, we were able to implement all three approaches to aid the identification of fraud inside the delinquent new accounts and we were able to achieve both our outlined objectives. By measuring fraud using random sampling, we found a fraud rate three times higher than what was reported by previous processes. And through our data science solution design, we were able to capture additional fraud through the automated deployment of outlier detection models with which we were able to realize a detection rate four times higher than of random sample rate. And through machine learning, we have achieved a detection rate 10 times higher than of the random sample rate.
We really enjoyed working through this problem and were able to learn a lot from bringing the solution to production here at T-Mobile. Here are some of the key takeaways that we have learned. For this project we were able to build an all encompassing solution within Azure by leveraging its robust suite of tools in a unique approach that fits our needs. Databricks provided us the platform to design a customized solution for our requirements. Model registration, productionizing and deployment was a seamless process utilizing MLflow. The entire solution design was able to be fully automated through Azure Data Factory. Every data science project has its nuances. The key is to leverage available tools in a customized approach that fits your needs.
Thank you so much for attending our session today. And we hope that this glimpse into our project has been informative and that this encourages you to build your own customized solution. Please feel free to reach out with any additional questions you may have and hope that you enjoy the rest of Data and AI Summit. Thank you.

Charu Kalra

Charu Kalra is a Senior Data Scientist at T-Mobile, where she employs the latest data engineering and machine learning techniques to identify and reduce fraud. She graduated from the Rutgers Universit...
Read more

Connor McCambridge

Connor McCambridge is a Senior Data Scientist at T-Mobile on the Fraud Reporting & Analytics Team where he leverages technology and machine learning to better understand and prevent fraud. Previously,...
Read more