Keeping Identity Graphs In Sync With Apache Spark

The online advertising industry is based on identifying users with cookies, and showing relevant ads to interested users. But there are many data providers, many places to target ads and many people browsing online. How can we identify users across data providers? The first step in solving this is by cookie mapping: a chain of server calls that pass identifiers across providers. Sadly, chains break, servers break, providers can be flaky or use caching and you may never see the whole of the chain. The solution to this problem is constructing an identity graph with the data we see: in our case, cookie ids are nodes, edges are relations and connected components of the graph are users.

In this talk I will explain how Hybrid Theory leverages Spark and GraphFrames to construct and maintain a 2000 million node identity graph with minimal computational cost.

About Ruben Berenguel

Ruben Berenguel is the lead data engineer at Hybrid Theory, as well as an occasional contributor for Spark (especially PySpark). PhD in Mathematics, he moved to data engineering where he works mostly with Scala, Python and Go designing and implementing big data pipelines in London and Barcelona.