Guest Post: Real-Time Fraud Detection in the Lakehouse

This is a guest post from our partner, Tecton

Published: November 22, 2023

News4 min read

The costs of fraud are staggering. In 2022, just one type of fraud, card-not-present fraud, resulted in almost $6bn in losses in the U.S. alone. According to the Federal Trade Commission, the top 5 fraud categories in the U.S. are¹:

Imposters
Online shopping
Prizes, sweepstakes, lotteries
Investments
Business and job opportunities

Many businesses have already begun to use AI to automate real-time fraud prevention and detection at scale. But this is a cat-and-mouse game where fraudsters continuously concoct new ways to sneak past detection. To stay ahead of them, AI models need to constantly evolve and take in the freshest data as inputs, making feature freshness and model development speed vital to success.

In this blog, we’ll introduce some key ways in which you can leverage Tecton on Databricks to build your real-time fraud detection system. Read through for some actual examples at the end!

Scaling the ML Feature Pipeline

Fraud is especially prevalent within vast, high-volume networks (think thousands of transactions per second). To catch fraud in these networks, companies need reliable and scalable storage and compute. The Databricks Data Intelligence Platform is an excellent option, especially since Delta Lake is used by 10,000+ companies to collectively process exabytes of data per day. On the ML model side, capabilities such as MLflow provide MLOps at scale. Databricks Model Serving exposes your MLflow machine learning models as scalable REST API endpoints, which provides a highly available and low-latency service for deploying models. The service automatically scales up or down to meet demand changes, saving infrastructure costs while optimizing latency performance. Databricks provides a secure environment for reliable storage, compute, model deployment, and monitoring.

Since its inception in 2019, Tecton has partnered with Databricks to supercharge its capabilities for real-time machine learning at production scale by solving the core challenge: real-time feature data pipelines. Tecton manages features-as-code and automates the end-to-end ML feature pipeline, from transformation and online serving to monitoring across batch, streaming and real-time data sources. The overall pipeline is built on Databricks compute and Delta Lake.

With Tecton and Databricks, data teams can maximize time to value for their ML models, ensure model accuracy and reliability in production, control costs, and future proof their ML stack.

Use Tecton on Databricks for real-time fraud detection

Unlocking batch, streaming and real-time ML features

The fresher the data inputs, the more likely you are to detect fraudulent behavior. Databricks keeps data in massively scalable cloud object storage with open source data standards, with access to your sensitive fraud data governed by Databricks Unity Catalog.

Tecton leverages the flexibility of the Lakehouse to compute features on massive fraud datasets. Taking credit card fraud as an example, Tecton on Databricks makes it very easy to infuse the latest data signals into your ML features. You may want to know how many transactions a customer completed in the last hour, day, and week. You can easily create these windowed aggregations with a few lines of code. Additionally, on-demand features can calculate a feature just-in-time with data provided at the time of inference, such as determining whether a current transaction is larger or smaller than the average threshold over a time window.

Deploying your ML features to production

Imagine that your data scientists have developed a few new features for your fraud detection model and you want to use them in production. With your features defined in Tecton, you can push these features to production in one click. Tecton handles taking in the latest raw data, transforms it into features at a schedule determined by you, makes those features easily available for training and serving, and monitors the feature performance in production. Tecton also optimizes the computation and storage of features to maximize cost efficient performance. Under the hood, Tecton leverages data sources like Delta Lake and Databricks compute.

Real-time inference at scale

Real-time inference is critical to catching fraud before more transactions can occur. Considering that credit card fraud alone causes more than $11 billion in losses in the U.S. each year, it is critical to catch fraud the moment it actually happens. According to security.org, even the simple act of providing a timely fraud alert allowed customers to catch fraud in their own accounts within minutes and hours (rather than days and weeks).

To stay ahead of fraudsters, you want to make sure that your fraud detection model can make decisions at lightning speed, even during high-transaction periods (such as during the holidays). Databricks’ real-time model serving deploys ML models as a REST API, allowing you to build real-time ML applications without the hassle of managing serving infrastructure.

Tecton seamlessly integrates with Databricks’ real-time model serving and provides a secure REST API for Databricks to get real-time features from the online store. Tecton itself uses enterprise security best-practices and is SOC 2 Type 2 Compliant.

Example architecture for fraud detection with Databricks and Tecton

Scaling to multiple ML models in production

With MLflow Model Registry and Model Serving on Databricks, teams can easily iterate on multiple models and promote the best candidates to production. Tecton makes it easy to manage the features delivered to any of these models, as well as monitor uptime and query performance in the online store. Because Tecton utilizes a declarative, features-as-code approach to feature generation, users can easily modify and extend existing features to meet the needs of the next model iteration.

Easily monitor activity and uptime for your online feature store in the Tecton Web UI

Interested in learning more about how to use Tecton on Databricks? Check out the Tecton docs or email [email protected].

For a sample notebook that demonstrates how to develop features and train a model for real-time fraud detection in Databricks, visit this github link or view the sample notebook.

What's next?

November 25, 2024/3 min read

Announcing the Winners of the Generative AI World Cup

December 11, 2024/4 min read