Sudha Viswanathan - Databricks

Sudha Viswanathan

Senior Software Engineer, Walmart Labs

Sudha is a lead Big Data Engineer at Walmart Labs pioneering in the area of building scalable and reliable data platforms. She has solid background in the full life cycle of data and systems to enable data driven decision making. Currently, she is working on Customer Identity Graph platform, which uses Spark as the processing engine and handles 20+ billion nodes enabling Walmart to identify its customers irrespective of the channel which brings them to Walmart. Previously, she worked at JP Morgan Chase where she built and productionized machine learning pipelines using Spark.


Walmart Customer Identity Graph – Powered by Apache SparkSummit 2020

Overview: Walmart has multiple subsidiaries and each one of them generates a unique customer id. Our goal is to identify our customers across channels and provide a 360 degree view of the customer. For example, when a customer shops in Walmart store and then happen to login to, Walmart must be able to identify that customer as an existing store customer and recommend products online based on his/her store transactions. Not only between stores and online world but also across channels, we need the same capability; an active customer of should not be treated as a new customer when he/she logs in to Every interaction and transaction data of Walmart contains some form of customer identity (such as cookies, email IDs, Walmart IDs, 3P IDs etc.). When such information is embedded within the streaming events, we need a platform to identify and link the identities belonging to the same customer. Hence we built the Customer Identity Graph platform using Spark processing engine in HDFS, which uses Union find algorithm with path compression at the back end.

I would like to present the following:

  1. The journey of building the Customer Identity Graph platform that handles 20+ Billion vertices and 30+ billion edges and an incremental 200M new linkages every day.
  2. Why we chose to build our own Graph processing framework using Spark instead of using GraphX or other distributed graph databases.
  3. How we handle Data Quality challenges.
  4. Optimization strategies implemented to overcome scalability and performance challenges faced while building and traversing the Graph.
  5. How the online servable Identity Graph enables high throughout with low latency in real-time streaming.
  • The feasibility of building your own Graph framework using Spark.
  • The idea of leveraging Graph in real-time to achieve high throughput.