Javier holds a degree in Computer Science from ORT University (Montevideo) and since 2015 has been working with Montevideo Labs as a Senior Data Engineer for large big data projects. He has helped top tech companies to architect their Spark applications, leading many successful projects from design and implementation to deployment. He is also an advocate of clean code as a central paradigm for development.
Notebooks are a great tool for Big Data. They have drastically changed the way scientists and engineers develop and share ideas. However, most world-class Spark products cannot be easily engineered, tested and deployed just by modifying or combining notebooks. Taking a prototype to production with high quality typically involves proper software engineering. The code we develop on such larger-scale projects must be modular, robust, readable, testable, reusable and performant. At Montevideo Labs we have many years of experience helping our clients to architect large Spark systems capable of processing data at peta-byte scale. In previous Spark Summits, we described how we productionalized an unattended Machine Learning system in Spark that trains thousands of ML models daily that are deployed for real-time serving at extremely low latency. In this instance, we will share lessons learned taking other Spark products to production in top tech US companies.
Throughout the session we will address the following questions along with the relevant best practices: How to make your Spark code readable, debuggable, reusable and testable? How to architect Spark components for different processing schemes, like batch ETL, low-latency services and model serving? How to package and deploy Spark applications to the cloud? In particular, we will do a deep dive into how to take advantage of Spark's laziness (and DAG-generation) to structure our code based on best software engineering practices regardless of efficiency issues. Instead of only focusing on code efficiency when structuring our Spark code, we can leverage this 'laziness' to follow the best software patterns and principles to write elegant, testable and highly maintainable code. Moreover, we can encapsulate Spark-specific code in classes and utilities and keep our business rules cleaner. We will aid this presentation with live demos to illustrate the concepts introduced.
dataxu bids on ads in real-time on behalf of its customers at the rate of 3 million requests a second and trains on past bids to optimize for future bids. Our system trains thousands of advertiser-specific models and runs multi-terabyte datasets. In this presentation we will share the lessons learned from our transition towards a fully automated Spark-based machine learning system and how this has drastically reduced the time to get a research idea into production. We'll also share how we: - continually ship models to production - train models in an unattended fashion with auto-tuning capabilities - tune and overbooked cluster resources for maximum performance - ported our previous ML solution into Spark - evaluate the performance of high-rate bidding models