At Salesforce, Kexin is responsible for research and design of the core distributed data processing and machine learning architecture for the Marketing Cloud Einstein and Salesforce DMP. Lead the Data Science Engineers in implementing their design, and own pushing to continuously improve on the various related operational aspects including performance, fault tolerance, scaling, automation, and costs. Before Salesforce, Kexin worked for Krux, BigCommerce, NICTA, Brandscreen, Freelancer and Microsoft Research building software systems for large-scale machine learning, data mining, real-time bidding, intelligent marketing, anti-fraud and anti-money laundering. Kexin also holds a Ph.D. degree in computer science.
Krux, a Salesforce company, is a Data Management Platform (DMP) that helps its clients collect, manage, analyze and activate their people data. With a wide range of premium clients such as Kellogg, L’Oréal, Warner Brothers, New York Times, Washington Post, Uber, Spotify and many other household names, they see over 3.5 billion unique users globally a month, across sites, media, mobile app, transactional and offline traffic sources. That is more than Facebook, Wikipedia and Twitter combined. Processing this scale of data volume and velocity has presented many challenges over the seven years Krux has existed, and they had to develop various proprietary strategies and technologies to overcome those. In this session, Salesforce will share how Apache Spark, in particular, helped transform the DMP’s data processing infrastructure, using as an example the evolution of their "Look-alike" algorithm. Look-alike, a similarity-based classifier, is one of the most commonly used algorithms by marketers and publishers looking to extend their audience reach. Get a high-level introduction to the use case and algorithm, and learn about Salesforce's experience in moving the implementation from Hadoop to Spark and how it increased the performance, reliability and serviceability of the product. You will also hear about some of the technical challenges they faced, including large scale joins with skewed data, and how they solved those in Spark. Learn how Spark provides a wide range of high-level and low-level APIs that prove useful when implementing customized machine learning algorithms as compared with Hadoop, and how the overall abstraction makes it very easy to develop modular and easy to maintain code that is also performant. Session hashtag: #SFeco3