Beth holds a PhD in speech recognition from the University of Cambridge. She has worked in a variety of fields including speech recognition, music indexing, computational biology, medical informatics, and activity monitoring. She currently leads and manages the Optimization Team, which is responsible for developing and maintaining the algorithms that drive DataXu’s real-time advertising platform.
The presentation will describe DataXu's experience of moving from a fully in-house developed system for machine learning based on Hadoop, to a hybrid system that leverages Spark's ML Pipeline tool to automate and improve the data science engineering behind the creation of classifiers used for real time bidding. DataXu was founded by MIT aeronautics and astronautics scientists who wrote the combinatorial language that guided NASA's Mars mission plans. These scientists - joined by co-founders with extensive digital media and consumer electronics expertise - examined potential commercial applications that would benefit from a system that could make real-time decisions. We bid on behalf of advertisers using machine learning and optimization techniques to find the opportunities and prices that maximize the return on investment. Currently DataXu processes 2 Petabytes of data per day and responds to ad auctions at a rate of 1.6 million requests per second across 5 different continents. On this presentation we will describe how we use Spark as a flexible framework that allows the production system to operate efficiently while allowing continuous data science experimentation. In particular, we will share: 1) How we're migrating from a Hadoop-based system that trains multiple models in one pass using custom code to a multi-pass process that leverages in-memory processing and Spark's ML pipelining. 2) How we're using smart partitioning and caching to continuously train a fixed number of models by batch in an incremental fashion (as opposed to our previous big-bang approach). 3) How a custom job-flow specification allowed us to achieve reliable production training while also supporting multi-language scientific experimentation and on-going improvement of models using data hooks and A/B testing. 4) How we were able to successfully use Spark's trained classifiers in a time-critical, high-throughput and multi-threaded setting.