Prasad Chalasani is the SVP of Data Science at Media Math, leading the development of innovative, proprietary scalable algorithms, and analytics that leverage massive amounts of data to power smarter digital marketing for the world’s leading advertisers. Prior to joining Media Math, Prasad led Data Science at Yahoo Research, and before that worked for 10 years as a quantitative researcher and portfolio manager of statistical trading strategies at hedge funds and at Goldman Sachs. Prasad holds a PhD in Computer Science from CMU and BTech in Computer Science from IIT.
Most traditional applications of Spark involve massive data-sets that already exist. A less-commonly encountered use-case, but nevertheless extremely useful, is in Simulations, where massive amounts of data are generated based on model parameters. In this talk we explore some of the challenges that arise in setting up scalable simulations in a specific application, and share some of our solutions and lessons learned along the way, in the realms of mathematics and programming. The application scenario we explore is to quantify the impact of cookie-contamination in randomized experiments aimed at measuring digital advertisement lift/effectiveness. Cookies are randomly assigned to test or control, and those in test are exposed to ads while those in control are not. The goal is to measure the lift in conversion-rate due to ad-exposure. One important factor that taints such measurements is cookie-contamination: a real-world user may have multiple cookies (but the system is unaware of this linkage), and if their cookies are in both test and control groups, then the cookie in control may show a higher conversion rate than that of a clean control cookie that has no "siblings" in the test group. Analytically quantifying the impact of this contamination is difficult without making overly simplistic assumptions, and one idea we pursued is to simulate the impact of cookie-contamination, with millions of trials over 10s of millions of users. The goals are: (a) understand/quantify the impact of cookie distribution and contamination, on the expected value of the computed lift as well as the 90% confidence interval, and (b) derive approximate analytical formulas for the observed lift. Scaling up the simulations to a large of trials and users is challenging, and we share some of our solutions, and also describe the analysis of error and expectation.
MediaMath is a leading ad-tech platform that responds to over 200 billion ad-opportunities daily, and leverages massive amounts of data to power smarter digital marketing. They use Spark heavily both in production and R&D to develop innovative, proprietary, and scalable solutions to multiple problems: (a) machine-learning models for predicting conversion probability given an ad-impression; (b) measuring causal effectiveness of advertising in randomized tests; (c) running simulations to understand the impact of cookie refreshes and other phenomena on ad effectiveness metrics; (d) finding deviceIDs belonging to the same user based on possibly noisy external deterministic information. In this presentation Prasad will describe these problems briefly, and dive deeper into how MediaMath extensively uses Spark and the Databricks platform to evaluate multiple machine learning models and model-update and calibration schemes, and visualize results.