Sudha is a lead Big Data Engineer at Walmart Labs pioneering in the area of building scalable and reliable data platforms. She has solid background in the full life cycle of data and systems to enable data driven decision making. Currently, she is working on Customer Identity Graph platform, which uses Spark as the processing engine and handles 20+ billion nodes enabling Walmart to identify its customers irrespective of the channel which brings them to Walmart. Previously, she worked at JP Morgan Chase where she built and productionized machine learning pipelines using Spark.
Overview: Walmart has multiple subsidiaries and each one of them generates a unique customer id. Our goal is to identify our customers across channels and provide a 360 degree view of the customer. For example, when a customer shops in Walmart store and then happen to login to walmart.com, Walmart must be able to identify that customer as an existing store customer and recommend products online based on his/her store transactions. Not only between stores and online world but also across channels, we need the same capability; an active customer of jet.com should not be treated as a new customer when he/she logs in to walmart.com. Every interaction and transaction data of Walmart contains some form of customer identity (such as cookies, email IDs, Walmart IDs, 3P IDs etc.). When such information is embedded within the streaming events, we need a platform to identify and link the identities belonging to the same customer. Hence we built the Customer Identity Graph platform using Spark processing engine in HDFS, which uses Union find algorithm with path compression at the back end.
I would like to present the following: