Building Identity Graphs over Heterogeneous Data

Download Slides

In today’s world, customers and service providers (e.g., Social networks, ad targeting, retail, etc.) interact in a variety of modes and channels such as browsers, apps, devices, etc. In each such interaction, users are identified using a token (possibly different token for each mode/channel). Examples of such identity tokens include cookies, app IDs etc. As the user engages more with these services, linkages are generated between tokens belonging to the same user; linkages connect multiple identity tokens together. A challenging problem is to unify the identities of a user into single connected component, to provide a unified identity view. This capability needs to extend beyond channels and create true unification of identity.Since every interaction or a transaction event contains some form of identity, a highly scalable platform is required to identify and link the identities belonging to a user as a connected component. Therefore, we built the Identity Graph platform using Spark processing engine, with a distributed version of Union-find algorithm with path compression.

We would like to present the following:

  • The journey of building a highly scalable Identity Graph platform that handles 25+ Billion vertices and 30+ billion edges and an incremental 200M new linkages every day.
  • Why we chose to build our own Graph processing framework using Spark instead of other distributed graph databases.
  • How we handle Data Quality challenges.
  • Optimization strategies implemented to overcome scalability and performance challenges faced while building and traversing the Graph.
  • A peek into online version of Identity Graph to enable real-time graph building, querying, and traversals


  • The feasibility of building a highly scalable Graph framework using Spark.
  • The idea of building and leveraging Graph in real-time to achieve freshness.

Try Databricks
« back
About Sudha Viswanathan

Walmart Labs

Sudha is a lead Big Data Engineer at Walmart Labs pioneering in the area of building scalable and reliable data platforms. She has solid background in the full life cycle of data and systems to enable data driven decision making. Currently, she is working on Customer Identity Graph platform, which uses Spark as the processing engine and handles 20+ billion nodes enabling Walmart to identify its customers irrespective of the channel which brings them to Walmart. Previously, she worked at JP Morgan Chase where she built and productionized machine learning pipelines using Spark.

About Saigopal Thota

Walmart Labs

Saigopal Thota is a Principal Data Scientist leading the Customer Identity at Walmart Labs. His areas of work includes Graph optimization algorithms, developing ML algorithms for Data Quality, Scalable real time, and batch systems. Saigopal has a PhD in Computer Science from University of California, Davis.