R&D to Product Pipeline Using Apache Spark in AdTech

Download Slides

The central premise of DataXu is to apply data science to better marketing. At its core, is the Real Time Bidding Platform that processes 2 Petabytes of data per day and responds to ad auctions at a rate of 2.1 million requests per second across 5 different continents. Serving on top of this platform is Dataxu’s analytics engine that gives their clients insightful analytics reports addressed towards client marketing business questions. Some common requirements for both these platforms are the ability to do real-time processing, scalable machine learning, and ad-hoc analytics. This talk will showcase DataXu’s successful use-cases of using the Apache Spark framework and Databricks to address all of the above challenges while maintaining its agility and rapid prototyping strengths to take a product from initial R&D phase to full production. The team will share their best practices and highlight the steps of large scale Spark ETL processing, model testing, all the way through to interactive analytics.

About Saket Mengle

Saket holds a PhD in text mining from Illinois Institute of Technology, Chicago. He has worked in a variety of fields including text classification, information retrieval, large scale machine learning and linear optimization. He currently works as Senior Principal Data Scientist at Dataxu Inc., where he is responsible for developing and maintaining the algorithms that drives Dataxu’s real-time advertising platform.

About Maximo Gurmendez

Maximo holds a Masters degree in Computer Science / Artificial Intelligence from Northeastern University where he attended as a Fulbright Scholar. Since 2009 he has been working with DataXu as a lead engineer, tackling the challenge of machine learning over large large data sets. He’s also the founder of MDATALABS (data science & engineering consultancy) and a professor at the School of Engineering, University of Montevideo, where he is conducting student projects involving the use of Apache Spark for large scale Data Science.