Maximo holds a Masters degree in Computer Science / Artificial Intelligence from Northeastern University where he attended as a Fulbright Scholar. Since 2009 he has been working with DataXu as a lead engineer, tackling the challenge of machine learning over large large data sets. He’s also the founder of MDATALABS (data science & engineering consultancy) and a professor at the School of Engineering, University of Montevideo, where he is conducting student projects involving the use of Apache Spark for large scale Data Science.
The presentation will describe DataXu's experience of moving from a fully in-house developed system for machine learning based on Hadoop, to a hybrid system that leverages Spark's ML Pipeline tool to automate and improve the data science engineering behind the creation of classifiers used for real time bidding. DataXu was founded by MIT aeronautics and astronautics scientists who wrote the combinatorial language that guided NASA's Mars mission plans. These scientists - joined by co-founders with extensive digital media and consumer electronics expertise - examined potential commercial applications that would benefit from a system that could make real-time decisions. We bid on behalf of advertisers using machine learning and optimization techniques to find the opportunities and prices that maximize the return on investment. Currently DataXu processes 2 Petabytes of data per day and responds to ad auctions at a rate of 1.6 million requests per second across 5 different continents. On this presentation we will describe how we use Spark as a flexible framework that allows the production system to operate efficiently while allowing continuous data science experimentation. In particular, we will share: 1) How we're migrating from a Hadoop-based system that trains multiple models in one pass using custom code to a multi-pass process that leverages in-memory processing and Spark's ML pipelining. 2) How we're using smart partitioning and caching to continuously train a fixed number of models by batch in an incremental fashion (as opposed to our previous big-bang approach). 3) How a custom job-flow specification allowed us to achieve reliable production training while also supporting multi-language scientific experimentation and on-going improvement of models using data hooks and A/B testing. 4) How we were able to successfully use Spark's trained classifiers in a time-critical, high-throughput and multi-threaded setting.
The central premise of DataXu is to apply data science to better marketing. At its core, is the Real Time Bidding Platform that processes 2 Petabytes of data per day and responds to ad auctions at a rate of 2.1 million requests per second across 5 different continents. Serving on top of this platform is Dataxu’s analytics engine that gives their clients insightful analytics reports addressed towards client marketing business questions. Some common requirements for both these platforms are the ability to do real-time processing, scalable machine learning, and ad-hoc analytics. This talk will showcase DataXu’s successful use-cases of using the Apache Spark framework and Databricks to address all of the above challenges while maintaining its agility and rapid prototyping strengths to take a product from initial R&D phase to full production. The team will share their best practices and highlight the steps of large scale Spark ETL processing, model testing, all the way through to interactive analytics.