Maximo holds a master’s degree in computer science/AI from Northeastern University, where he attended as a Fulbright Scholar. As Chief Engineer of Montevideo Labs he leads data science engineering projects for complex systems in large US companies. He is an expert in big data technologies and co-author of the popular book ‘Mastering Machine Learning on AWS.’ Additionally, Maximo is a computer science professor at the University of Montevideo and is director of its data science for business program.
Notebooks are a great tool for Big Data. They have drastically changed the way scientists and engineers develop and share ideas. However, most world-class Spark products cannot be easily engineered, tested and deployed just by modifying or combining notebooks. Taking a prototype to production with high quality typically involves proper software engineering. The code we develop on such larger-scale projects must be modular, robust, readable, testable, reusable and performant. At Montevideo Labs we have many years of experience helping our clients to architect large Spark systems capable of processing data at peta-byte scale. In previous Spark Summits, we described how we productionalized an unattended Machine Learning system in Spark that trains thousands of ML models daily that are deployed for real-time serving at extremely low latency. In this instance, we will share lessons learned taking other Spark products to production in top tech US companies.
Throughout the session we will address the following questions along with the relevant best practices: How to make your Spark code readable, debuggable, reusable and testable? How to architect Spark components for different processing schemes, like batch ETL, low-latency services and model serving? How to package and deploy Spark applications to the cloud? In particular, we will do a deep dive into how to take advantage of Spark's laziness (and DAG-generation) to structure our code based on best software engineering practices regardless of efficiency issues. Instead of only focusing on code efficiency when structuring our Spark code, we can leverage this 'laziness' to follow the best software patterns and principles to write elegant, testable and highly maintainable code. Moreover, we can encapsulate Spark-specific code in classes and utilities and keep our business rules cleaner. We will aid this presentation with live demos to illustrate the concepts introduced.
dataxu bids on ads in real-time on behalf of its customers at the rate of 3 million requests a second and trains on past bids to optimize for future bids. Our system trains thousands of advertiser-specific models and runs multi-terabyte datasets. In this presentation we will share the lessons learned from our transition towards a fully automated Spark-based machine learning system and how this has drastically reduced the time to get a research idea into production. We'll also share how we: - continually ship models to production - train models in an unattended fashion with auto-tuning capabilities - tune and overbooked cluster resources for maximum performance - ported our previous ML solution into Spark - evaluate the performance of high-rate bidding models
The presentation will describe DataXu's experience of moving from a fully in-house developed system for machine learning based on Hadoop, to a hybrid system that leverages Spark's ML Pipeline tool to automate and improve the data science engineering behind the creation of classifiers used for real time bidding. DataXu was founded by MIT aeronautics and astronautics scientists who wrote the combinatorial language that guided NASA's Mars mission plans. These scientists - joined by co-founders with extensive digital media and consumer electronics expertise - examined potential commercial applications that would benefit from a system that could make real-time decisions. We bid on behalf of advertisers using machine learning and optimization techniques to find the opportunities and prices that maximize the return on investment. Currently DataXu processes 2 Petabytes of data per day and responds to ad auctions at a rate of 1.6 million requests per second across 5 different continents. On this presentation we will describe how we use Spark as a flexible framework that allows the production system to operate efficiently while allowing continuous data science experimentation. In particular, we will share: 1) How we're migrating from a Hadoop-based system that trains multiple models in one pass using custom code to a multi-pass process that leverages in-memory processing and Spark's ML pipelining. 2) How we're using smart partitioning and caching to continuously train a fixed number of models by batch in an incremental fashion (as opposed to our previous big-bang approach). 3) How a custom job-flow specification allowed us to achieve reliable production training while also supporting multi-language scientific experimentation and on-going improvement of models using data hooks and A/B testing. 4) How we were able to successfully use Spark's trained classifiers in a time-critical, high-throughput and multi-threaded setting.
The central premise of DataXu is to apply data science to better marketing. At its core, is the Real Time Bidding Platform that processes 2 Petabytes of data per day and responds to ad auctions at a rate of 2.1 million requests per second across 5 different continents. Serving on top of this platform is Dataxu’s analytics engine that gives their clients insightful analytics reports addressed towards client marketing business questions. Some common requirements for both these platforms are the ability to do real-time processing, scalable machine learning, and ad-hoc analytics. This talk will showcase DataXu’s successful use-cases of using the Apache Spark framework and Databricks to address all of the above challenges while maintaining its agility and rapid prototyping strengths to take a product from initial R&D phase to full production. The team will share their best practices and highlight the steps of large scale Spark ETL processing, model testing, all the way through to interactive analytics.