Nick is a Principal Engineer at IBM. He’s a member of the Apache Spark PMC and author of Machine Learning with Spark. Previously, he co-founded Graphflow, a startup focused on recommendations and customer intelligence. He has worked at Goldman Sachs, Cognitive Match, and led the Data Science team at Mxit, Africa’s largest social network. He’s passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add business value.
In the last few years, deep learning has achieved significant success in a wide range of domains, including computer vision, artificial intelligence, speech, NLP, and reinforcement learning. However, deep learning in recommender systems has, until recently, received relatively little attention. This talks explores recent advances in this area in both research and practice. I will explain how deep learning can be applied to recommendation settings, architectures for handling contextual data, side information, and time-based models, and compare deep learning approaches to other cutting-edge contextual recommendation models, and finally explore scalability issues and model serving challenges.
Tuning a Spark ML model with cross-validation can be an extremely computationally expensive process. As the number of hyperparameter combinations increases, so does the number of models being evaluated. The default configuration in Spark is to evaluate each of these models one-by-one to select the best performing. When running this process with a large number of models, if the training and evaluation of a model does not fully utilize the available cluster resources then that waste will be compounded for each model and lead to long run times. Enabling model parallelism in Spark cross-validation, from Spark 2.3, will allow for more than one model to be trained and evaluated at the same time and make better use of cluster resources. We will go over how to enable this setting in Spark, what effect this will have on an example ML pipeline and best practices to keep in mind when using this feature. Additionally, we will discuss ongoing work in progress to reduce the amount of computation required when tuning ML pipelines by eliminating redundant transformations and intelligently caching intermediate datasets. This can be combined with model parallelism to further reduce the run time of cross-validation for complex machine learning pipelines.
The common perception of machine learning is that it starts with data and ends with a model. In real-world production systems, the traditional data science and machine learning workflow of data preparation, feature engineering and model selection, while important, is only one aspect. A critical missing piece is the deployment and management of models, as well as the integration between the model creation and deployment phases. This is particularly challenging in the case of deploying Apache Spark ML pipelines for low-latency scoring. While MLlib's DataFrame API is powerful and elegant, it is relatively ill-suited to the needs of many real-time predictive applications, in part because it is tightly coupled with the Spark SQL runtime. In this talk I will introduce the Portable Format for Analytics (PFA) for portable, open and standardized deployment of data science pipelines & analytic applications. I will also introduce and evaluate Aardpfark, a library for exporting Spark ML pipelines to PFA, as well as compare and contrast it to other available alternatives including PMML, MLeap, ONNX and Apple's CoreML.
The talk will cover how Graphflow uses Spark to power its real-time recommendation and customer intelligence platform. We will cover how we use Spark and MLlib to process and analyze customer behavior data for recommendation and predictive analytics models. We will also give an overview of using Spark and Shark to power data aggregation and analytics for customer insights and front-end data visualization apps.