%md
# Geospatial fraud detection
*A large scale fraud prevention system is usually a complex ecosystem made of various controls (all with critical SLAs), a mix of traditional rules and AI and a patchwork of technologies between proprietary on-premises systems and open source cloud technologies. In a previous [solution accelerator](https://databricks.com/blog/2021/01/19/combining-rules-based-and-ai-models-to-combat-financial-fraud.html), we addressed the problem of blending rules with AI in a common orchestration layer powered by MLFlow. In this series of notebooks centered around geospatial analytics, we demonstrate how Lakehouse enables organizations to better understand customers behaviours, no longer based on who they are, but how they bank, no longer using a one-size-fits-all rule but a truly personalized AI. After all, identifying abnormal patterns can only be made possible if one first understands what a normal behaviour is, and doing so for millions of customers becomes a challenge that requires both data and AI combined into one platform. As part of this solution, we are releasing a new open source geospatial library, [GEOSCAN](https://github.com/databrickslabs/geoscan), to detect geospatial behaviours at massive scale, track customers patterns over time and detect anomalous card transactions*
---
+ <a href="https://databricks.com/notebooks/geoscan/00_geofraud_context.html">STAGE0</a>: Home page
+ <a href="https://databricks.com/notebooks/geoscan/01_geofraud_clustering.html">STAGE1</a>: Using a novel approach to geospatial clustering with H3
+ <a href="https://databricks.com/notebooks/geoscan/02_geofraud_fraud.html">STAGE2</a>: Detecting anomalous transactions as ML enriched rules
---
<antoine.amend@databricks.com>
Geospatial fraud detection
A large scale fraud prevention system is usually a complex ecosystem made of various controls (all with critical SLAs), a mix of traditional rules and AI and a patchwork of technologies between proprietary on-premises systems and open source cloud technologies. In a previous solution accelerator, we addressed the problem of blending rules with AI in a common orchestration layer powered by MLFlow. In this series of notebooks centered around geospatial analytics, we demonstrate how Lakehouse enables organizations to better understand customers behaviours, no longer based on who they are, but how they bank, no longer using a one-size-fits-all rule but a truly personalized AI. After all, identifying abnormal patterns can only be made possible if one first understands what a normal behaviour is, and doing so for millions of customers becomes a challenge that requires both data and AI combined into one platform. As part of this solution, we are releasing a new open source geospatial library, GEOSCAN, to detect geospatial behaviours at massive scale, track customers patterns over time and detect anomalous card transactions
- STAGE0: Home page
- STAGE1: Using a novel approach to geospatial clustering with H3
- STAGE2: Detecting anomalous transactions as ML enriched rules
antoine.amend@databricks.com
%md
## Context
In the previous [notebook](https://databricks.com/notebooks/geoscan/01_geofraud_clustering.html), we demonstrated how GEOSCAN can help financial services institutions leverage their entire dataset to better understand customers specific behaviours. In this notebook, we want to use the insights we have gained earlier to extract anomalous events and bridge the technological gap that exists between analytics and operations environments. More often than not, Fraud detection frameworks run outside of an analytics environment due to the combination of data sensitivity (PII), regulatory requirements (PCI/DSS) and model materiality (high SLAs and low latency). For these reasons, we explore here multiple strategies to serve our insights either as a self contained framework or through an online datastore (such as [redis](https://redis.io/), [mongodb](https://www.mongodb.com/) or [elasticache](https://aws.amazon.com/elasticache/) - although many other solutions may be viable)
Context
In the previous notebook, we demonstrated how GEOSCAN can help financial services institutions leverage their entire dataset to better understand customers specific behaviours. In this notebook, we want to use the insights we have gained earlier to extract anomalous events and bridge the technological gap that exists between analytics and operations environments. More often than not, Fraud detection frameworks run outside of an analytics environment due to the combination of data sensitivity (PII), regulatory requirements (PCI/DSS) and model materiality (high SLAs and low latency). For these reasons, we explore here multiple strategies to serve our insights either as a self contained framework or through an online datastore (such as redis, mongodb or elasticache - although many other solutions may be viable)
%pip install pybloomfiltermmap3==0.5.3 h3==3.7.1 folium==0.12.1 mlflow
%md
## Attaching transactional context to geo clusters
As we've trained personalized models for each customer, we can easily understand the type of transactions as well as the day and hours these transactions usually take place. Are these clusters more "active" during working hours or on week ends? Are these transactions more about fast foods and coffee shops or are they driving fewer but more expensives items? Such a geospatial analytics framework combined with transaction enrichment (future solution accelerator) could tell us great information about our customers' spends beyond demographics, moving towards a customer centric approach to retail banking. Unfortunately, our synthetic dataset does not contain any additional attributes to learn behavioral pattern from. For the purpose of this exercise, we will retrieve our clusters (as tiled with H3 polygon as introduced earlier) as-is to detect transactions that happened outside of any known location.
Attaching transactional context to geo clusters
As we've trained personalized models for each customer, we can easily understand the type of transactions as well as the day and hours these transactions usually take place. Are these clusters more "active" during working hours or on week ends? Are these transactions more about fast foods and coffee shops or are they driving fewer but more expensives items? Such a geospatial analytics framework combined with transaction enrichment (future solution accelerator) could tell us great information about our customers' spends beyond demographics, moving towards a customer centric approach to retail banking. Unfortunately, our synthetic dataset does not contain any additional attributes to learn behavioral pattern from. For the purpose of this exercise, we will retrieve our clusters (as tiled with H3 polygon as introduced earlier) as-is to detect transactions that happened outside of any known location.
tiles = spark.read.table('geospatial.tiles')
display(tiles)
Showing the first 1000 rows.
%md
As the core of our framework relies on open data standards ([RFC7946](https://tools.ietf.org/html/rfc7946)), we could load our models as a simple Dataframe without relying on the GEOSCAN library. We simply have to read the `data` directory of our model output.
As the core of our framework relies on open data standards (RFC7946), we could load our models as a simple Dataframe without relying on the GEOSCAN library. We simply have to read the data
directory of our model output.
model_personalized = spark.read.format('parquet').load('/FileStore/antoine.amend@databricks.com/models/geoscan_personalized/data')
display(model_personalized)
Showing all 200 rows.
%md
## Extracting anomalies
Our (simplisitic) approach will be to detect if a transaction was executed in a popular area for each of our customers. Since we have stored and indexed all of our models as H3 tiles, it becomes easy to enrich each transaction with their cluster using a simple JOIN operation (for large scale processing) or lookup (for real time scoring) instead of complex geospatial queries like point in polygon search. Although we are using the H3 python API instead of GEOSCAN library, our generated H3 hexadecimal values are consistent - assuming we select the same resolution we used to generate those tiles (10). For reference, please have a look at the H3 [resolution table](https://h3geo.org/docs/core-library/restable)
Extracting anomalies
Our (simplisitic) approach will be to detect if a transaction was executed in a popular area for each of our customers. Since we have stored and indexed all of our models as H3 tiles, it becomes easy to enrich each transaction with their cluster using a simple JOIN operation (for large scale processing) or lookup (for real time scoring) instead of complex geospatial queries like point in polygon search. Although we are using the H3 python API instead of GEOSCAN library, our generated H3 hexadecimal values are consistent - assuming we select the same resolution we used to generate those tiles (10). For reference, please have a look at the H3 resolution table
%md
In the example below, we can easily extract transactions happenning outside of any customer prefered locations. Please note that we previously relaxed our conditions by adding 3 extra layers of H3 polygons to capture transactions happenning in close vicinity of spending clusters
In the example below, we can easily extract transactions happenning outside of any customer prefered locations. Please note that we previously relaxed our conditions by adding 3 extra layers of H3 polygons to capture transactions happenning in close vicinity of spending clusters
from pyspark.sql import functions as F
anomalous_transactions = (
spark
.read
.table('geospatial.transactions')
.withColumn('h3', to_h3(F.col('latitude'), F.col('longitude'), F.lit(10)))
.join(tiles, ['user', 'h3'], 'left_outer')
.filter(F.expr('cluster IS NULL'))
.drop('h3', 'cluster', 'tf_idf')
)
display(anomalous_transactions)
Showing all 81 rows.
%md
Out of half a million transactions, we extracted 81 records in less than 5 seconds. Not necessarily fraudulent, maybe not even suspicious, these transactions did not match any of our users "normal" behaviours, and as such, are worth flagging as part of an overhatching fraud prevention framework. In real life example, we should factor for time and additional transactional context. Would a same transaction happening on a Sunday afternoon or a Wednesday morning be suspicious given user characteristics we could learn?
Out of half a million transactions, we extracted 81 records in less than 5 seconds. Not necessarily fraudulent, maybe not even suspicious, these transactions did not match any of our users "normal" behaviours, and as such, are worth flagging as part of an overhatching fraud prevention framework. In real life example, we should factor for time and additional transactional context. Would a same transaction happening on a Sunday afternoon or a Wednesday morning be suspicious given user characteristics we could learn?
%md
Before moving forwards, it is always benefitial to validate our strategy (altough not empirically) using a simple visualization for a given customer (`99407ef8-40ae-424e-b9ae-9fd2e4838ec3`), reporting card transactions happenning outside of any known patterns.
Before moving forwards, it is always benefitial to validate our strategy (altough not empirically) using a simple visualization for a given customer (99407ef8-40ae-424e-b9ae-9fd2e4838ec3
), reporting card transactions happenning outside of any known patterns.
import folium
from folium import plugins
from pyspark.sql import functions as F
user = '99407ef8-40ae-424e-b9ae-9fd2e4838ec3'
anomalies = anomalous_transactions.filter(F.col('user') == user).toPandas()
clusters = model_personalized.filter(F.col('user') == user).toPandas().cluster.iloc[0]
personalized = folium.Map([40.75466940037548,-73.98365020751953], zoom_start=12, width='80%', height='100%')
folium.TileLayer('Stamen Toner').add_to(personalized)
for i, point in anomalies.iterrows():
folium.Marker([point.latitude, point.longitude], popup=point.amount).add_to(personalized)
folium.GeoJson(clusters, name="geojson").add_to(personalized)
personalized
%md
Although this synthetic data does not show evidence of suspicious transactions, we demonstrated how anomalous records can easily be extracted from a massive dataset without the need to run complex geospatial queries. In fact, the same can now be achieved using standard SQL functionalities in a notebook or in a SQL analytics workspace.
Although this synthetic data does not show evidence of suspicious transactions, we demonstrated how anomalous records can easily be extracted from a massive dataset without the need to run complex geospatial queries. In fact, the same can now be achieved using standard SQL functionalities in a notebook or in a SQL analytics workspace.
%md
## Real time fraud detection
With millions of transactions and low latency requirements, it would not be realistic to join datasets in real time. Although we could load all clusters (their H3 tiles) in memory, we may have evaluated multiple models at different time of the days for different users, for different segments or different transaction indicators (e.g. for different brand category or [MCC codes](https://en.wikipedia.org/wiki/Merchant_category_code)) resulting in a complex system that requires efficient lookup strategies against multiple variables.
Real time fraud detection
With millions of transactions and low latency requirements, it would not be realistic to join datasets in real time. Although we could load all clusters (their H3 tiles) in memory, we may have evaluated multiple models at different time of the days for different users, for different segments or different transaction indicators (e.g. for different brand category or MCC codes) resulting in a complex system that requires efficient lookup strategies against multiple variables.
%md
### Bloom filters
Here comes [Bloom Filters](https://en.wikipedia.org/wiki/Bloom_filter), an efficient probabilistic data structure than can test the existence of a given record without keeping an entire set in memory. Although bloom filters have been around for a long time, its usage has not - sadly - been democratized beyond complex engineering techniques such as database optimization engines and daunting execution planners (Delta engine leverages bloom filters optimizations among other techniques). This technique is a powerful tool worth having in a modern data science toolkit.
Bloom filters
Here comes Bloom Filters, an efficient probabilistic data structure than can test the existence of a given record without keeping an entire set in memory. Although bloom filters have been around for a long time, its usage has not - sadly - been democratized beyond complex engineering techniques such as database optimization engines and daunting execution planners (Delta engine leverages bloom filters optimizations among other techniques). This technique is a powerful tool worth having in a modern data science toolkit.
%md
#### The theory
The concept behind a bloom filter is to convert a series of records (in our case a H3 location) into a series of hash values, overlaying each of their byte arrays representations as vectors of 1 and 0. Testing the existence of a given record results in testing the existence of each of its bits set to 1. Given a record `w`, if any of its bit is not found in our set, we are 100% sure we haven't seen record `w` before. However, it all of its bits are found in our set, it could be caused by an unfortunate succession of hash collisions. In other words, Bloom filters offer a false negative rate of 0 but a non zero false positive rate (records we wrongly assume have been seen) that can controlled by the length of our array and the number of hash functions.
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/ac/Bloom_filter.svg/720px-Bloom_filter.svg.png">
[Source](https://en.wikipedia.org/wiki/Bloom_filter)
The theory
The concept behind a bloom filter is to convert a series of records (in our case a H3 location) into a series of hash values, overlaying each of their byte arrays representations as vectors of 1 and 0. Testing the existence of a given record results in testing the existence of each of its bits set to 1. Given a record w
, if any of its bit is not found in our set, we are 100% sure we haven't seen record w
before. However, it all of its bits are found in our set, it could be caused by an unfortunate succession of hash collisions. In other words, Bloom filters offer a false negative rate of 0 but a non zero false positive rate (records we wrongly assume have been seen) that can controlled by the length of our array and the number of hash functions.
%md
#### The practice
We will be using the `pybloomfilter` python library to validate this approach, training a Bloom filter against each and every known H3 tile of a given user. Although our filter may logically contains millions of records, we would only need to physically maintain 1 byte array in memory to enable a probabilistic search (controlled here with a 1% false positive rate).
The practice
We will be using the pybloomfilter
python library to validate this approach, training a Bloom filter against each and every known H3 tile of a given user. Although our filter may logically contains millions of records, we would only need to physically maintain 1 byte array in memory to enable a probabilistic search (controlled here with a 1% false positive rate).
normal_df = tiles.filter(F.col('user') == user).select(F.col('h3')).toPandas()
normal_df['matched'] = normal_df['h3'].apply(lambda x: x in cluster)
display(normal_df)
Showing all 84 rows.