In the Credit Card Companies, illegitimate credit card usage is a serious problem which results in a need to accurately detect fraudulent transactions vs non-fraudulent transactions. All organizations can be hugely impacted by fraud and fraudulent activities, especially those in financial services. The threat can originate from internal or external, but the effects can be devastating – including loss of consumer confidence, incarceration for those involved, even up to downfall of a corporation. Despite regular fraud prevention measures, these are constantly being put to the test in an attempt to beat the system.
Fraud detection is a task of predicting whether a card has been used by the cardholder. One of the methods to recognize fraud card usage is to leverage Machine Learning (ML) models. In order to more dynamically detect fraudulent transactions, one can train ML models on a set of dataset including credit card transaction information as well as card and demographic information of the owner of the account. This will be our goal of the project while leveraging Databricks.
Badrish Davay: Hello everyone. The topic of this session is fraud detection using machine learning in Databricks by, Badrish and Neil. The agenda goes like this, after a brief introduction, we will dive into use cases after which we will be solutioning the ways to detect the fraud transactions with some deep insights on how we use data breaks and MLFlow. At the end, we will be having a brief demo session, showcasing all the different parts of the equation. Hello everyone, again. Just a brief introduction about us. This is Badrish and I have my colleague, Neil Allen, along with me. About us, I’m a Tech Evangelist, like to explore data trends and predictions, have been building large data pipelines and ML platforms for the last six to seven years. Overall, in the industry for about 19 years now. I work closely with the other data scientists and make the orchestration of modeling simpler for them. Some of my favorite topics include deep learning, forecasting and predictions on loss forecasting, and detecting fraud activities and financial services.
A little bit about Neil, Neil is a big data fanatic, has been working predominantly on big data problems and machine learning use Cases. He loves working on cutting edge tech stack to explore better ways and simpler ways to get things done, has been in this field for about eight years, worked on large datasets, including advertisement datasets to find pattern matching. We’d like to thank Maryam. She’s one of the key player on working closely with us to explore our journey in Machine Learning space. And particularly this use case, which we are going to discuss today. She has been working in the ML platform and deep learning data science space for about five plus years. And she’s very passionate of solving deep learning problems and classification style use cases.
Someone has said, “What happens in every economic downturn is that the attacks start to become more successful, so over the next two to three years, I fully expect credit card fraud numbers to increase in a pretty meaningful way”. And also there is an article from CNBC, what they say is, “The United States is the most credit card fraud prone country in the world. COVID-19 is playing a major role in the explosive growth in credit card fraud activities. Experts warn there aren’t enough regulations protecting small businesses from chargebacks caused by fraudulent transactions. Companies such as, Visa, are looking to technological solutions, including artificial intelligence to solve credit card fraud”.
So as you know, this is the topic of discussion today. So what is credit card fraud activity? Whenever the actual credit card holder is not directly involved in the transaction, we will consider this even to be fraudulent credit card activity. These fraudulent activities usually take place whenever sensitive information about the credit card or the physical credit card is stolen or misplaced. These events can result in a lot of financial loss because either the client has to bear the financial burden or the banks end up paying for the items even though they weren’t fraudulent transactions.
Across all the 50 states, there were about 14,000 reports of credit card fraud in the most recent data set available, from 2019 of all the states. In order to tackle the problem of fraudulent transaction, we are going to be analyzing the fraudulent activities themselves to uncover important features about collecting credit card fraud. By uncovering such insights, we can further drill down into analyzing presence of any trend in fraudulent activities. In order to help us with the analysis, we need to ask six w questions and come up with answers to each of them. And those w questions are what, who, when, where, why, and what if.
For a good model development, we need a good orchestration framework, a better visualization of trends, a decent ML platform monitoring for data quality and health of the systems, ability to do what if analysis easier and ease of collaboration between different team members, which brings together a full suite of model development experience.
What is at stake here? Ease of use, real time detection, deep analytics and modeling, security and notification service. So these are the five building blocks, which are at stake to build a good model and seamlessly integrating with any enterprise systems to solve any complex use case in Machine Learning. Fraudulent activities pose a serious threat to risk management of any financial services companies and it has really serious consequences. From the perspective of a financial institution, customers end up losing faith on the organization. Furthermore, this can cause a lot of unintended mental and financial distress for the customers. Over the years, despite significant advancements in credit card fraud, risk management techniques adapted, attackers are still able to find the loopholes and exploit the system. These days we can utilize state-of-the-art Machine Learning algorithms in order to stay ahead of the attackers and at the same time, constantly learn new ways a system is being exploited.
All the use cases discussed here are in no means connected to capital one use cases. These experiments are purely based on personal learning needs and to see how we can use Databricks and MLFlow to solve some of the real world use cases. The data being used in the demo is also not related to capital one. We have used publicly available data from kaggle.com to conduct these experiments. So let’s move on to the use cases. The first use case here is, let’s assume a card holder users his card in a gas station from a City for example, New York. The data inside the magnetic strip of the card gets stolen. And within few minutes, the credit card gets used in different locations far away from New York city. This can be considered as one of the fraudulent activity. That’s our first use case.
The second use case suppose a person buys a shirt online from an unpopular website. The information of the guard will be used later by the hacker with a random amount in a random time of the day. For example, deepening, depending on the contextual information of the credit card, considering the credit card owner, we can say that if the owner is an office worker with a nine to five job, it’s unusual for the credit card to be used at a completely different location on a V-Day during office hours.
And here’s the third use case, suppose you are in a grocery store and you lost your card or you drop it somewhere. You may not notice it right away. And then when your credit card number information is stolen, your credit card can be used in other locations. In order not to raise any suspicion, the malicious attacker can keep on using the card while making a transaction of small amount. After a while, the malicious attacker can make a large transaction, once it is clear that the regular usage of credit card did not draw any attention. This can result in large financial loss and these kinds of fraudulent activities are very difficult to identify. So it will be useful if card holder can be notified of these suspicious activities right away, isn’t it? So let’s dig deeper into these use cases and solutions to solve all the three use cases.
What is Machine Learning? The Machine Learning insights can be used towards various business use cases we discussed earlier. So Machine Learning is an algorithm which learns insights by looking at the vast amount of data. We can harness these insights by asking you to perform a task on unseen data. There are different flavors of Machine Learning, such as supervised learning and unsupervised learning. The main difference is that in supervised learning, we provide ground truth to the model whereas in the unsupervised learning model tries to learn insights without any ground truth. In supervised learning, we can perform two main kinds of tasks, which is classification and regression. We can break down classification tasks into different categories, such as binary classification, multi-class and multi-level classification. In binary classification, we usually have two labels where we want to predict if something happened or not, or classify between two categories in our dataset.
Once we have the data, Machine Learning model training workflow can be broken into four main stages, stratifying the data into two steps, train and test. The model will never see the test set in its training lifetime. The second step consists of finding the best sets of hyper-parameters for training the model. We can use several algorithm such as grid search or Hyperopt to find the best hyper-parameters. Once we have the best hyper-parameters, we can then train the model using different Python libraries, such as Scikit-learn. In general, we pick up Machine Learning model, which best fits a business use case such as SVM, Decision Tree, Pickle and so on. Once we have the train model, we want to verify its performance on the dataset coming in real-time. We ideally want to use business case driven metrics in order to validate model performance. We can also use inbuilt implementation of performance metrics from Scikit-learn library in Python.
We can leverage Machine Learning in order to identify fraudulent activity while using a credit card. Given the knowledge that we already know about the program activities that happened in the past, we can identify a data set of credit card transaction containing fraudulent and normal activities and use a Machine Learning model best suited for our use case, which is given transaction characteristics to predict whether a transaction is fraudulent or not. From credit card usage, we can detect fraudulent activities by using transaction on contextual features. Based on our definition of performance metrics, we can then select the best performing model to deploy and serve. Imagine a fraud detection and notification workflow as shown here, a transaction takes place at a time T and location X, these transaction details along with contextual information I’ve sent to machine learning analytical service to determine whether it’s a fraudulent transaction or not.
According to the results of this analytical model, if it determines the transaction is fraudulent, we immediately notify the credit cardholder that a fraudulent an activity has been detected. We are expecting a feedback from a cardholder identifying whether the transaction was a fraudulent activity or not. Given the feedback from the cardholder, the credit card extreme bank would be notified. And if it is not a fraudulent transaction, it goes ahead as usual. We can implement the workflow as discussed before easily by utilizing Databricks and MLFlow. We can come up with the experiments and collaboratively develop and run those experiments with the team using Databricks. Even though, we are sharing experiment and data collaboratively within the team, we can implement stringent security measures in order to respect data privacy. Each experiment can have its own computer environments and requirements. Towards the end, we can utilize a cluster that suits the instrument’s compute needs. While running different experiments in different clusters, we can track each and every one of those instruments. Furthermore, Databricks is kind of one stop shop for all of our data science and model serving needs. And that makes it perfect for data science projects.
Considering that we have identified features for predicting whether a transaction is fraudulent or not, we can pass these data points into Databricks’ hosted environment. Here, we can perform feature engineering, data pre-processing and split the data into training and test set. Following this, we can use any flavor of Machine Learning algorithms such as SVM, Decision Tree and random forest in order to train a model. As we mentioned before, we can identify the best performing model by looking at different evaluation metrics, such as recall and position performance metrics.
Once we have the best model as per our definition, we can use Databricks to serve it directly from within the Databricks’ platform. MLFlow within the Databricks ecosystem is another great feature that we can use because it has numerous advantages in developing ML workflow pipeline, seamlessly. MLFlow allows us to simply track our ML experiments from end-to-end, often Machine Learning model life cycle, that is tracking inputs and outputs to a Machine Learning model, the best hyper-parameters as well as the results obtained while validating the ML model. We can run experiments directly from GitHub without the need to go through the code. We can directly deploy trade models by serializing them while utilizing packages such as Pickle, Spark ML and so on. After we have best model already, we can register that through MLFlow dashboard in Databricks. We can then deploy this utilized model and serve it as an API by harnessing MLFlow. And these are the five steps redefined in the life cycle.
What is micro service and why is it useful in MLFlow? A microservice is a gateway to a specific functional aspect of an application. It helps us develop applications in a standardized, consistent manner over time. Microservices allows us deploy functionality of applications independent of each other. It helps us abstract the functionality, but at the same time opens up the ability to build reusable and uniform way of interacting with an application. Furthermore, it lets us compose complex behavior by combining a variety of other microservices together. Essentially, it lets us use any tech stack in the backend while maintaining compatibility in the front end. That’s what the data bricks provides. This approach is fault tolerant since we can quickly isolate point of failure and act accordingly.
Considering that a credit card transaction has occurred, we will be sending the transaction information and contextual information to the ML inference pipeline. So what is ML inference pipeline? The ML inference pipeline will determine whether a transaction was actually a fraudulent or not. That’s actually the heart of this whole ecosystem. If it decides that a transaction is fraudulent, we will use notification microservices to notify suspicious activity to respective clients. If the client gets back to us confirming that activity was indeed fraudulent transaction, we will notify the bank through a microservice again. Otherwise, we will let the transaction flow it’s normal procedure.
Considering we have our way and raw big data stored in Amazon S3, we can quickly integrate interactions between S3 and our framework through Databricks seamlessly. By harnessing Databricks, we are able to massively scale Machine Learning model training validation, and deployment pipelines through MLFlow. We can train and validate models on custom easy to clusters in AWS and deploy our models through SageMaker directly by using MLFlow APIs. That’s a great, isn’t it? Furthermore, through Databricks, we can query and deploy model, manage the deployment and clean up the deployment all while using MLFlow APIs within the AWS ecosystem. In addition, we can ensure safe security and conditional access while AWS SSO and wind rose. And that’s what requires a most of our enterprise today. The tight integration between AWS and Databricks is simple and intuitive to use. Now I will pass on to my colleague, Neil Allen to show a demo using Databricks and a MLFlow.
Neil Allen: Hello everyone, I’m Neil Allen. And I will be showcasing this Python based notebook, or we will be demonstrating how to use Databricks and MLFlow for the detection of fraudulent credit card activity. We will be using experimentation features of MLFlow along with some interesting packages, such as Scikit-learn, Pandas, Imbalanced-learn and a few others. We begin by first getting data from DBFS file storage as a Pandas data frame. This data set is a public Kaggle data set and for the sake of this demo, we’ve changed the name of several columns in order to make them more readable and presentable.
As you can see here, our data set includes several features, such as time and amount. In addition, we have a class variable, this will be considered a target variable for our exercise. Here we would like to see the unique values of our target variable. As we can see there are only zeros and ones in our dataset, in our data set a zero means that a transaction is not fraudulent. And a one means the transaction is fraudulent. Now let’s see if we have a class imbalance in our dataset. To that end, we plotted the frequency of classes in our dataset. For this block, we can see, we have a highly imbalanced dataset.
This imbalanced dataset makes us highly susceptible to model over fitting. In order to make sure our model is not affected by this imbalanced dataset, we use an under sampling method to make the data set balance. Now we can see we’ve achieved a more balanced data dataset. Now that we have a balanced data set to work with, let’s look at some visualizations to understand the correlation between the feature variables and the target. Here we can see some of the features are highly correlated to the target class. For example, as can be seen here, transaction distance can have a highly positive correlation with a fraudulent subset of the samples class. This means the distance of the location of two transactions is spatially distant enough. It could be a fraudulent transaction.
Now using this correlation matrix, we want to see the correlations between feature variables as well as the target variable. A more bluish shade means a positive correlation whereas an orange is shaped means a negative correlation. Here, we can see again that correlation between the transaction distance and classes blue, which means a positive correlation. Now we have different functions for pre-processing dataset. For example, imputing values for no column values, normalizing some features and stratifying the data based on some specific train test split ratio. Here, we have specified a general configuration of our dataset. Then we applied the configures to our dataset in order to pre-process the data and get the training and test dataset.
Now that data pre-processing is done. We want to train our model. We used three different models for our training purposes, and then we will show how MLFlow makes it easy to track the experimentation. In addition, we use Hyperopt, which will help us to find the best hyper-parameter space from a predefined search base automatically. First, we defined the hyper-parameter space for our runs. As you can see, we specified three models, SVM, Decision tree, and random forest, along with their different parameters.
Now that we have the configuration of a models, we will use this function to actually call Scikit-learn for training purposes. Here I would like to show how we used MLFlow library in a Databricks notebook to create experiments or use existing experiments by their name if we have already begun experimentation. By using MLFlow dot start underscore run, we start tracking the training of models automatically. MLFlow can be used to log the Scikit-learn models. This includes logging the model as an MLFlow artifact, along with model definition or parameters. An MLFlow experiment is a primary unit for organizing and running a Machine Learning experiment. For any run where we do use MLFlow, it can belong to an experiment. This helps us to visualize, search and compare runs. Furthermore, we can download artifacts and metadata saved during the run for further analysis, using any tool of our choice.
And now here in the output, you are seeing multiple referencing the fact that we already have an existing experiment by this name. This means that our latest run will be appended along with our other runs for our purposes later. Now I would like to show an example of experiments in the MLFlow dashboard. Here if we click the experiments tab, we will see all the experiments that were run from this notebook. Using this link, we can go to the tracking dashboard directly and we will be selecting the latest experiment run to explore further. Once you’re in MLFlow dashboard, we can see the source script, parameters along with metrics and their corresponding visualizations through the run.
Now, by clicking on training accuracy score metric, we can visualize the accuracy scores for each iteration of the run. We can also add multiple metrics to visualize, compare and analyze the results. If we go back, we can see that there are tags and artifacts that were logged during the run. We can see that MLFlow recorded different artifacts, such as the model Pickle file, which can be downloaded by clicking on the download button. Furthermore, we can see that visualizations such as precision recall curve, ROC curve and confusion matrix images are logged as well. If we want to reproduce the run, it’s extremely easy. We just need to click on the reproduce run button and we can run the same experiment as before. This is important and lets us easily replicate and reproduce the same results each time.
If we go back to the main directory with all the experiments, we can easily select multiple experiments and compare the results for each of these runs. We will be selecting these two experiments to compare. Here you can see, we have the parameters, metrics and plots side-by-side. Furthermore, we can plot each feature and metrics and look at the impact of the feature on the result. For example, we can see how utilizing criterions impacted the training accuracy score. We can see that there are multiple kinds of thoughts that are auto-generated. These are fully configurable. For example, here, you can see by utilizing the parameters, the values are shown. The training accuracy was 94%. And this concludes our notebook demo.
Badrish Davay: Thank you, Neil. This was a very insightful demo on using MLFlow with Databricks, with all the data we had with Kaggle. At last, we would request everyone to give us a feedback. It’s very important for us. Please, don’t forget to rate and review this particular session online. And at last, I would like to thank my team, Neil and Maryam, for all their efforts on working on this use cases and bringing a lot of insights by learning Databricks, and also time data, big summit team in providing this opportunity to share our experiences to the world. Thank you.