02_geofraud_fraud(Python)

Loading...

Geospatial fraud detection

A large scale fraud prevention system is usually a complex ecosystem made of various controls (all with critical SLAs), a mix of traditional rules and AI and a patchwork of technologies between proprietary on-premises systems and open source cloud technologies. In a previous solution accelerator, we addressed the problem of blending rules with AI in a common orchestration layer powered by MLFlow. In this series of notebooks centered around geospatial analytics, we demonstrate how Lakehouse enables organizations to better understand customers behaviours, no longer based on who they are, but how they bank, no longer using a one-size-fits-all rule but a truly personalized AI. After all, identifying abnormal patterns can only be made possible if one first understands what a normal behaviour is, and doing so for millions of customers becomes a challenge that requires both data and AI combined into one platform. As part of this solution, we are releasing a new open source geospatial library, GEOSCAN, to detect geospatial behaviours at massive scale, track customers patterns over time and detect anomalous card transactions


  • STAGE0: Home page
  • STAGE1: Using a novel approach to geospatial clustering with H3
  • STAGE2: Detecting anomalous transactions as ML enriched rules

antoine.amend@databricks.com

Context

In the previous notebook, we demonstrated how GEOSCAN can help financial services institutions leverage their entire dataset to better understand customers specific behaviours. In this notebook, we want to use the insights we have gained earlier to extract anomalous events and bridge the technological gap that exists between analytics and operations environments. More often than not, Fraud detection frameworks run outside of an analytics environment due to the combination of data sensitivity (PII), regulatory requirements (PCI/DSS) and model materiality (high SLAs and low latency). For these reasons, we explore here multiple strategies to serve our insights either as a self contained framework or through an online datastore (such as redis, mongodb or elasticache - although many other solutions may be viable)

%pip install pybloomfiltermmap3==0.5.3 h3==3.7.1 folium==0.12.1 mlflow
Show result

Attaching transactional context to geo clusters

As we've trained personalized models for each customer, we can easily understand the type of transactions as well as the day and hours these transactions usually take place. Are these clusters more "active" during working hours or on week ends? Are these transactions more about fast foods and coffee shops or are they driving fewer but more expensives items? Such a geospatial analytics framework combined with transaction enrichment (future solution accelerator) could tell us great information about our customers' spends beyond demographics, moving towards a customer centric approach to retail banking. Unfortunately, our synthetic dataset does not contain any additional attributes to learn behavioral pattern from. For the purpose of this exercise, we will retrieve our clusters (as tiled with H3 polygon as introduced earlier) as-is to detect transactions that happened outside of any known location.

tiles = spark.read.table('geospatial.tiles')
display(tiles)
 
user
cluster
h3
tf_idf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
4c420442-ad21-42c2-9083-fc49eff08658
2
8A2A100F249FFFF
10.564714693938615
4c420442-ad21-42c2-9083-fc49eff08658
1
8A2A100D2497FFF
6.657084036813761
20fa691d-0f57-47b8-b02d-9b11bf670fcd
3
8A2A100D6607FFF
40.81628650625178
20fa691d-0f57-47b8-b02d-9b11bf670fcd
5
8A2A1072C2A7FFF
6.093287563185636
20fa691d-0f57-47b8-b02d-9b11bf670fcd
4
8A2A10725227FFF
13.190127126657282
20fa691d-0f57-47b8-b02d-9b11bf670fcd
5
8A2A1072D99FFFF
6.804088960799614
ad4afa8e-08a6-4b21-a970-3108625ded21
4
8A2A1008D027FFF
52.28290125699032
ad4afa8e-08a6-4b21-a970-3108625ded21
3
8A2A10725767FFF
3.5380001166358364
ad4afa8e-08a6-4b21-a970-3108625ded21
2
8A2A1072CAF7FFF
6.657084036813761
8c57cc6e-7a6b-436b-8bd8-6ebadebaea56
5
8A2A100D614FFFF
34.96272692162268
8c57cc6e-7a6b-436b-8bd8-6ebadebaea56
6
8A2A100D140FFFF
21.303530497186628
8c57cc6e-7a6b-436b-8bd8-6ebadebaea56
1
8A2A100AA2D7FFF
10.163373303014312
8c57cc6e-7a6b-436b-8bd8-6ebadebaea56
3
8A2A1008DA8FFFF
14.861559839991813
8c57cc6e-7a6b-436b-8bd8-6ebadebaea56
2
8A2A10721847FFF
3.352447539260435
475235f4-6977-4555-a24d-bee02ab40d1e
1
8A2A100D346FFFF
41.82853169270477
475235f4-6977-4555-a24d-bee02ab40d1e
0
8A2A1072191FFFF
3.376938559268731
77806952-071b-4340-9637-3cf0ab2f1ba1
5
8A2A100D3C1FFFF
10.155712086250277
77806952-071b-4340-9637-3cf0ab2f1ba1
2
8A2A1072DC67FFF
6.855593953004444

Showing the first 1000 rows.

As the core of our framework relies on open data standards (RFC7946), we could load our models as a simple Dataframe without relying on the GEOSCAN library. We simply have to read the data directory of our model output.

model_personalized = spark.read.format('parquet').load('/FileStore/antoine.amend@databricks.com/models/geoscan_personalized/data')
display(model_personalized)
 
user
cluster
1
2
3
4
5
6
7
8
9
10
11
9f1db052-34d7-4765-b761-a1ca695aac47
{"type":"FeatureCollection","features":[{"type":"Feature","id":0,"properties":{"name":"CLUSTER-0"},"geometry":{"type":"Polygon","coordinates":[[[-73.9889453112901,40.71062048474754],[-73.98399230057818,40.71118596398378],[-73.97876611505194,40.71370009900722],[-73.98053410075359,40.71985467576307],[-73.98250746186373,40.7212036976212],[-73.99465371169123,40.72481231719942],[-73.99587899858518,40.724886953846095],[-73.99727022757318,40.72474131491665],[-73.99774576161406,40.72240904068028],[-73.99791389377285,40.72003124240348],[-73.99503974011245,40.714661885919554],[-73.99205628788214,40.71204622632105],[-73.9889453112901,40.71062048474754]]]}},{"type":"Feature","id":1,"properties":{"name":"CLUSTER-1"},"geometry":{"type":"Polygon","coordinates":[[[-73.97712979405824,40.73105798073428],[-73.97478301866964,40.731337019293726],[-73.97461321800313,40.73258228778922],[-73.97482869274954,40.76165736595704],[-73.97699083174996,40.76332596214534],[-73.98014282870288,40.7657233870877],[-73.981...
7a55538d-007f-494e-a3b9-95c2b2a52207
{"type":"FeatureCollection","features":[{"type":"Feature","id":0,"properties":{"name":"CLUSTER-0"},"geometry":{"type":"Polygon","coordinates":[[[-73.97558811191603,40.72958916983983],[-73.97391912042526,40.731576068588275],[-73.97546177413234,40.7353105899514],[-73.97988674062832,40.736809625833374],[-73.98146126062024,40.73687501265456],[-73.9827354619863,40.73484458837315],[-73.97819701874542,40.73043582548788],[-73.97558811191603,40.72958916983983]]]}},{"type":"Feature","id":1,"properties":{"name":"CLUSTER-1"},"geometry":{"type":"Polygon","coordinates":[[[-74.0024648303684,40.74019025401797],[-73.99891752403038,40.74104192916136],[-73.99540831543378,40.74286358897046],[-73.99470465343823,40.746065876915324],[-73.99486737920653,40.74687077512607],[-73.99779757570279,40.75693535945431],[-73.99840103948401,40.757836331285915],[-73.9995239514307,40.758615286592786],[-74.00344302155172,40.758292983955634],[-74.00558109142915,40.756022419420006],[-74.00789223745001,40.7526140157744],[-74....
84557835-849a-4665-ac1c-35941f24a978
{"type":"FeatureCollection","features":[{"type":"Feature","id":0,"properties":{"name":"CLUSTER-0"},"geometry":{"type":"Polygon","coordinates":[[[-74.0016210823734,40.73983553620303],[-74.00153383710918,40.73983790711519],[-74.0004124212418,40.74019207572536],[-73.99563281272046,40.74301935536932],[-73.99399904039822,40.747002223628655],[-73.99499197101218,40.75668768062775],[-74.00222223333186,40.75945937672014],[-74.00562686466535,40.756075135432724],[-74.00625365958336,40.75535659392547],[-74.0067058899014,40.754642804705476],[-74.00896853798945,40.748915592586975],[-74.00826280847323,40.74213659579032],[-74.00679845371259,40.74045002629673],[-74.0016210823734,40.73983553620303]]]}},{"type":"Feature","id":1,"properties":{"name":"CLUSTER-1"},"geometry":{"type":"Polygon","coordinates":[[[-73.97546933737678,40.74334866485118],[-73.97434346589743,40.74359480261331],[-73.9713483285376,40.74518608546917],[-73.97085869111524,40.747195506383136],[-73.97033237822903,40.7516339535133],[-73.970...
b47c815d-5bf3-4004-9139-514438b4d746
{"type":"FeatureCollection","features":[{"type":"Feature","id":0,"properties":{"name":"CLUSTER-0"},"geometry":{"type":"Polygon","coordinates":[[[-73.97406045532378,40.731841975497936],[-73.97406045532378,40.731841975497936],[-73.97762112668144,40.733580075523825],[-73.97406045532378,40.731841975497936],[-73.97406045532378,40.731841975497936]]]}},{"type":"Feature","id":1,"properties":{"name":"CLUSTER-1"},"geometry":{"type":"Polygon","coordinates":[[[-74.00253076376109,40.73964896187779],[-74.00068693375252,40.74050831935639],[-73.9972330856821,40.74373127024382],[-73.99506650694094,40.75636189523518],[-73.99668847841514,40.75761298298672],[-74.00096084687833,40.757389105796015],[-74.00556707161998,40.75456586650321],[-74.00801847079951,40.750290382793885],[-74.00517065004757,40.74124963434176],[-74.00253076376109,40.73964896187779]]]}},{"type":"Feature","id":2,"properties":{"name":"CLUSTER-2"},"geometry":{"type":"Polygon","coordinates":[[[-73.9948954899003,40.71099847273855],[-73.982912...
1aef7e13-94ed-45e3-a2c3-4648475f20c0
{"type":"FeatureCollection","features":[{"type":"Feature","id":0,"properties":{"name":"CLUSTER-0"},"geometry":{"type":"Polygon","coordinates":[[[-73.97100826069082,40.77239162484282],[-73.97013556035047,40.77354847215227],[-73.9700650164464,40.773982137672945],[-73.97048095841636,40.774564642225734],[-73.97052670905587,40.774617383936715],[-73.97070552020513,40.77472051997588],[-73.97223133902966,40.77462551371856],[-73.97244331789089,40.774457895930254],[-73.97248487418378,40.774402806261996],[-73.97310400917878,40.77346863446995],[-73.97261332734251,40.773211981863945],[-73.97100826069082,40.77239162484282]]]}},{"type":"Feature","id":1,"properties":{"name":"CLUSTER-1"},"geometry":{"type":"Polygon","coordinates":[[[-73.92861231473367,40.77098741603371],[-73.92761039025156,40.77106795541138],[-73.92578041913468,40.77235771791788],[-73.92544346630376,40.77269045470097],[-73.92494415885317,40.77335130344419],[-73.92527965835662,40.77528530544546],[-73.92599449476653,40.77569812438272],[-...
f721cb72-59f0-4191-85f3-1f935923180f
{"type":"FeatureCollection","features":[{"type":"Feature","id":0,"properties":{"name":"CLUSTER-0"},"geometry":{"type":"Polygon","coordinates":[[[-73.9756790520487,40.72856183433147],[-73.9756790520487,40.72856183433147],[-73.97467899885422,40.73090828173853],[-73.97467899885422,40.73090828173853],[-73.97543609566566,40.73353110159939],[-73.9763383398128,40.735394851987245],[-73.9769200886928,40.735756786519936],[-73.97914310259448,40.73677574745124],[-73.98134936237958,40.737363557176984],[-73.9815695701963,40.73741155807137],[-73.98421708483819,40.73253886969194],[-73.98264626554041,40.73144850228287],[-73.97842925722827,40.72967437071384],[-73.9756790520487,40.72856183433147]]]}},{"type":"Feature","id":1,"properties":{"name":"CLUSTER-1"},"geometry":{"type":"Polygon","coordinates":[[[-73.9538633988938,40.7651880648227],[-73.9538633988938,40.7651880648227],[-73.95174314622673,40.76913029054398],[-73.95174314622673,40.76913029054398],[-73.95225425182211,40.77219281205282],[-73.953102327...
46686449-8c2d-4673-b031-b69bdc3393ef
{"type":"FeatureCollection","features":[{"type":"Feature","id":0,"properties":{"name":"CLUSTER-0"},"geometry":{"type":"Polygon","coordinates":[[[-73.9749360176266,40.72966068827283],[-73.9749360176266,40.72966068827283],[-73.97422652175689,40.731621729186564],[-73.97422652175689,40.731621729186564],[-73.97493835571194,40.735324696239786],[-73.97610183811331,40.736048577183084],[-73.98035662680512,40.737660094961136],[-73.98348683676866,40.73396116284122],[-73.98442461282693,40.73226354022465],[-73.98344398314356,40.73175060775841],[-73.97927263229263,40.730029205683074],[-73.97847497335648,40.72972707903481],[-73.9749360176266,40.72966068827283]]]}},{"type":"Feature","id":1,"properties":{"name":"CLUSTER-1"},"geometry":{"type":"Polygon","coordinates":[[[-73.9417485282421,40.81511984776281],[-73.9399175849255,40.81527655196356],[-73.93932199694085,40.81685833392453],[-73.93939662160822,40.817666332572934],[-73.93963327055886,40.81928003053878],[-73.94012349542163,40.82180504599478],[-73....
753632ae-39f5-497c-89f2-7c6bcc061357
{"type":"FeatureCollection","features":[{"type":"Feature","id":0,"properties":{"name":"CLUSTER-0"},"geometry":{"type":"Polygon","coordinates":[[[-73.99168123257103,40.71027670398007],[-73.98359782946446,40.71114270077486],[-73.98194961250631,40.71140297175685],[-73.97817201188285,40.71301503504666],[-73.98011910767735,40.72040523686259],[-73.98056357212438,40.72060897740519],[-73.98737164056158,40.72279813964301],[-73.99602468627877,40.725260577776204],[-73.99756385014275,40.723330929570615],[-73.9988284995296,40.72108513580366],[-73.99920075459629,40.719456927529414],[-73.99897105799336,40.7180608444488],[-73.9963891919332,40.71457132869792],[-73.99168123257103,40.71027670398007]]]}},{"type":"Feature","id":1,"properties":{"name":"CLUSTER-1"},"geometry":{"type":"Polygon","coordinates":[[[-74.00261800869141,40.739646590213276],[-73.9964968255309,40.742780111736025],[-73.99394387635343,40.745600918727995],[-73.9960638054679,40.756172973451186],[-73.99744097408029,40.757862380158656],[-74...
99407ef8-40ae-424e-b9ae-9fd2e4838ec3
{"type":"FeatureCollection","features":[{"type":"Feature","id":0,"properties":{"name":"CLUSTER-0"},"geometry":{"type":"Polygon","coordinates":[[[-73.97816757927455,40.72968143094231],[-73.97816757927455,40.72968143094231],[-73.97491642839876,40.73252017322482],[-73.97491642839876,40.73252017322482],[-73.9787966712127,40.73349441797063],[-73.97816757927455,40.72968143094231]]]}},{"type":"Feature","id":1,"properties":{"name":"CLUSTER-1"},"geometry":{"type":"Polygon","coordinates":[[[-73.99555482807504,40.710009844041544],[-73.98220697564007,40.71128815736682],[-73.9788448999848,40.71348224836952],[-73.98380731022796,40.714265012423816],[-73.99406452317075,40.71538946941151],[-73.99679445969086,40.71267270953618],[-73.99674024554369,40.712404525582734],[-73.99555482807504,40.710009844041544]]]}},{"type":"Feature","id":2,"properties":{"name":"CLUSTER-2"},"geometry":{"type":"Polygon","coordinates":[[[-74.0036075572548,40.74037498708773],[-73.99801630830987,40.741444049615446],[-73.997347860...
3117685b-93f5-421b-b464-ffbeaae9cd28
{"type":"FeatureCollection","features":[{"type":"Feature","id":0,"properties":{"name":"CLUSTER-0"},"geometry":{"type":"Polygon","coordinates":[[[-73.97200443786842,40.744089395572644],[-73.96297747350792,40.75458335393403],[-73.96282379217797,40.755127057580914],[-73.96305240737179,40.755390714819534],[-73.96427444631676,40.75649105247827],[-73.97085856369056,40.75955175967991],[-73.97391805119359,40.76071055494608],[-73.97487817024465,40.76068470138022],[-73.97477596502384,40.75469783367492],[-73.97455942503916,40.75362450287598],[-73.97200443786842,40.744089395572644]]]}},{"type":"Feature","id":1,"properties":{"name":"CLUSTER-1"},"geometry":{"type":"Polygon","coordinates":[[[-73.99545915124374,40.70979672352426],[-73.98460264278297,40.71116945789058],[-73.97978243158465,40.711785094509864],[-73.97939274222577,40.7129820811397],[-73.97989477260785,40.72024948745662],[-73.98122395860175,40.720752948837806],[-73.98737164056158,40.72279813964301],[-73.99693835197137,40.725181840780536],[...
5cd64172-0be5-4198-897a-348d191a6c53
{"type":"FeatureCollection","features":[{"type":"Feature","id":0,"properties":{"name":"CLUSTER-0"},"geometry":{"type":"Polygon","coordinates":[[[-73.97951863360315,40.73185661869977],[-73.97654491248615,40.732853880866585],[-73.9745264444005,40.73371744913029],[-73.98008648285618,40.73745159784299],[-73.98039814152267,40.737605027042086],[-73.98277209826354,40.73354891640344],[-73.98285089815695,40.73333100911732],[-73.98290439674061,40.7324664561511],[-73.97951863360315,40.73185661869977]]]}},{"type":"Feature","id":1,"properties":{"name":"CLUSTER-1"},"geometry":{"type":"Polygon","coordinates":[[[-73.98952262623986,40.71087449497353],[-73.98330741363019,40.71152806035676],[-73.97977097653165,40.71372689982927],[-73.98222411699118,40.719539346474114],[-73.98894148078257,40.722755643990375],[-73.99516428251576,40.72447483123515],[-73.99838175952343,40.723039027391295],[-74.0006184401271,40.72109043847359],[-74.00094599616232,40.720542165520726],[-73.99616189037958,40.71101803135044],[-73...

Showing all 200 rows.

Extracting anomalies

Our (simplisitic) approach will be to detect if a transaction was executed in a popular area for each of our customers. Since we have stored and indexed all of our models as H3 tiles, it becomes easy to enrich each transaction with their cluster using a simple JOIN operation (for large scale processing) or lookup (for real time scoring) instead of complex geospatial queries like point in polygon search. Although we are using the H3 python API instead of GEOSCAN library, our generated H3 hexadecimal values are consistent - assuming we select the same resolution we used to generate those tiles (10). For reference, please have a look at the H3 resolution table

import h3
from pyspark.sql.functions import udf
 
@udf("string")
def to_h3(lat, lng, precision):
  h = h3.geo_to_h3(lat, lng, precision)
  return h.upper()

In the example below, we can easily extract transactions happenning outside of any customer prefered locations. Please note that we previously relaxed our conditions by adding 3 extra layers of H3 polygons to capture transactions happenning in close vicinity of spending clusters

from pyspark.sql import functions as F
 
anomalous_transactions = (
  spark
    .read
    .table('geospatial.transactions')
    .withColumn('h3', to_h3(F.col('latitude'), F.col('longitude'), F.lit(10)))
    .join(tiles, ['user', 'h3'], 'left_outer')
    .filter(F.expr('cluster IS NULL'))
    .drop('h3', 'cluster', 'tf_idf')
)
 
display(anomalous_transactions)
 
user
latitude
longitude
amount
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
24379a45-4122-4aec-acbd-b2fd28be6e1d
40.75789141965946
-73.99835952741506
140.89
07e9962a-ab5a-4bd3-9fd0-5c84dd9f546d
40.75296015493235
-73.99515091580336
43.79
07e9962a-ab5a-4bd3-9fd0-5c84dd9f546d
40.71992404451979
-73.9859494230664
42.86
07e9962a-ab5a-4bd3-9fd0-5c84dd9f546d
40.73536095189145
-73.97559472880282
41.07
07e9962a-ab5a-4bd3-9fd0-5c84dd9f546d
40.731617526390565
-73.98038251228536
99.57
dbcabce0-1492-49ab-bf39-a657341c57de
40.73511173179816
-74.00868281199843
197.52
dbcabce0-1492-49ab-bf39-a657341c57de
40.733468062443606
-74.00564845810317
92.58
dbcabce0-1492-49ab-bf39-a657341c57de
40.734369585425085
-74.01018163461728
5.89
dbcabce0-1492-49ab-bf39-a657341c57de
40.7319138338179
-73.99733475436797
108.66
009e640a-cce9-432e-9390-f4c45c0e2cd1
40.729354817132815
-73.97627751555675
170.12
009e640a-cce9-432e-9390-f4c45c0e2cd1
40.730941729860355
-73.97943924472561
77.27
009e640a-cce9-432e-9390-f4c45c0e2cd1
40.730681955895854
-73.97707146867972
170.01
009e640a-cce9-432e-9390-f4c45c0e2cd1
40.73571388014588
-73.98250748305402
49.93
009e640a-cce9-432e-9390-f4c45c0e2cd1
40.73249323478542
-73.98390966539424
157.63
009e640a-cce9-432e-9390-f4c45c0e2cd1
40.73521691856492
-73.9749341570664
178.99
e25c5ce2-0546-4f2d-a1bb-94e133b7bc1c
40.75093671229667
-74.00607850584055
196.3
e25c5ce2-0546-4f2d-a1bb-94e133b7bc1c
40.748898521472434
-74.00761383513255
190.58
e25c5ce2-0546-4f2d-a1bb-94e133b7bc1c
40.73524565701462
-73.97986931351151
66.8

Showing all 81 rows.

Out of half a million transactions, we extracted 81 records in less than 5 seconds. Not necessarily fraudulent, maybe not even suspicious, these transactions did not match any of our users "normal" behaviours, and as such, are worth flagging as part of an overhatching fraud prevention framework. In real life example, we should factor for time and additional transactional context. Would a same transaction happening on a Sunday afternoon or a Wednesday morning be suspicious given user characteristics we could learn?

Before moving forwards, it is always benefitial to validate our strategy (altough not empirically) using a simple visualization for a given customer (99407ef8-40ae-424e-b9ae-9fd2e4838ec3), reporting card transactions happenning outside of any known patterns.

import folium
from folium import plugins
from pyspark.sql import functions as F
 
user = '99407ef8-40ae-424e-b9ae-9fd2e4838ec3'
anomalies = anomalous_transactions.filter(F.col('user') == user).toPandas()
clusters = model_personalized.filter(F.col('user') == user).toPandas().cluster.iloc[0]
 
personalized = folium.Map([40.75466940037548,-73.98365020751953], zoom_start=12, width='80%', height='100%')
folium.TileLayer('Stamen Toner').add_to(personalized)
 
for i, point in anomalies.iterrows():
  folium.Marker([point.latitude, point.longitude], popup=point.amount).add_to(personalized)
 
folium.GeoJson(clusters, name="geojson").add_to(personalized)
personalized
Show result

folium

Although this synthetic data does not show evidence of suspicious transactions, we demonstrated how anomalous records can easily be extracted from a massive dataset without the need to run complex geospatial queries. In fact, the same can now be achieved using standard SQL functionalities in a notebook or in a SQL analytics workspace.

Real time fraud detection

With millions of transactions and low latency requirements, it would not be realistic to join datasets in real time. Although we could load all clusters (their H3 tiles) in memory, we may have evaluated multiple models at different time of the days for different users, for different segments or different transaction indicators (e.g. for different brand category or MCC codes) resulting in a complex system that requires efficient lookup strategies against multiple variables.

Bloom filters

Here comes Bloom Filters, an efficient probabilistic data structure than can test the existence of a given record without keeping an entire set in memory. Although bloom filters have been around for a long time, its usage has not - sadly - been democratized beyond complex engineering techniques such as database optimization engines and daunting execution planners (Delta engine leverages bloom filters optimizations among other techniques). This technique is a powerful tool worth having in a modern data science toolkit.

The theory

The concept behind a bloom filter is to convert a series of records (in our case a H3 location) into a series of hash values, overlaying each of their byte arrays representations as vectors of 1 and 0. Testing the existence of a given record results in testing the existence of each of its bits set to 1. Given a record w, if any of its bit is not found in our set, we are 100% sure we haven't seen record w before. However, it all of its bits are found in our set, it could be caused by an unfortunate succession of hash collisions. In other words, Bloom filters offer a false negative rate of 0 but a non zero false positive rate (records we wrongly assume have been seen) that can controlled by the length of our array and the number of hash functions.

Source

The practice

We will be using the pybloomfilter python library to validate this approach, training a Bloom filter against each and every known H3 tile of a given user. Although our filter may logically contains millions of records, we would only need to physically maintain 1 byte array in memory to enable a probabilistic search (controlled here with a 1% false positive rate).

import pybloomfilter
 
def train_bloom(records):
  cluster = pybloomfilter.BloomFilter(len(records), 0.01)
  cluster.update(records)
  return cluster
  
records = list(tiles.filter(F.col('user') == user).toPandas().h3)
cluster = train_bloom(records)

We retrieve all the points we know exist and test our false negative rate (should be null)

normal_df = tiles.filter(F.col('user') == user).select(F.col('h3')).toPandas()
normal_df['matched'] = normal_df['h3'].apply(lambda x: x in cluster)
display(normal_df)
 
h3
matched
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
8A2A100F262FFFF
true
8A2A1072180FFFF
true
8A2A1072D507FFF
true
8A2A100F2667FFF
true
8A2A107252A7FFF
true
8A2A10721977FFF
true
8A2A100F2647FFF
true
8A2A100F22AFFFF
true
8A2A100F2317FFF
true
8A2A1072CAF7FFF
true
8A2A1072DDAFFFF
true
8A2A10721B2FFFF
true
8A2A10725057FFF
true
8A2A107252D7FFF
true
8A2A100F231FFFF
true
8A2A100F2657FFF
true
8A2A100D3C17FFF
true
8A2A100F276FFFF
true

Showing all 84 rows.

Similarly, we access our anomalous transactions to validate our false positive rate (should be lower than 1%).