Nick Pentreath - Databricks

Nick Pentreath

Principal Engineer, IBM

Nick is a Principal Engineer at IBM. He’s a member of the Apache Spark PMC and author of Machine Learning with Spark. Previously, he co-founded Graphflow, a startup focused on recommendations and customer intelligence. He has worked at Goldman Sachs, Cognitive Match, and led the Data Science team at Mxit, Africa’s largest social network. He’s passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add business value.


Recurrent Neural Networks for Recommendations and PersonalizationSummit Europe 2018

In the last few years, RNNs have achieved significant success in modeling time series and sequence data, in particular within the speech, language, and text domains. Recently, these techniques have been begun to be applied to session-based recommendation tasks, with very promising results. This talk explores the latest research advances in this domain, as well as practical applications. I will provide an overview of RNNs, covering common architectures and applications, before diving deeper into RNNs for session-based recommendations. I will pay particular attention to the challenges inherent in common personalization tasks and the specific adjustments to models and optimization techniques required for success.


Deep Learning for Recommender SystemsSummit 2018

In the last few years, deep learning has achieved significant success in a wide range of domains, including computer vision, artificial intelligence, speech, NLP, and reinforcement learning. However, deep learning in recommender systems has, until recently, received relatively little attention. This talks explores recent advances in this area in both research and practice. I will explain how deep learning can be applied to recommendation settings, architectures for handling contextual data, side information, and time-based models, and compare deep learning approaches to other cutting-edge contextual recommendation models, and finally explore scalability issues and model serving challenges. Session hashtag: #AISAIS13

Model Parallelism in Spark ML Cross-ValidationSummit 2018

Tuning a Spark ML model with cross-validation can be an extremely computationally expensive process. As the number of hyperparameter combinations increases, so does the number of models being evaluated. The default configuration in Spark is to evaluate each of these models one-by-one to select the best performing. When running this process with a large number of models, if the training and evaluation of a model does not fully utilize the available cluster resources then that waste will be compounded for each model and lead to long run times. Enabling model parallelism in Spark cross-validation, from Spark 2.3, will allow for more than one model to be trained and evaluated at the same time and make better use of cluster resources. We will go over how to enable this setting in Spark, what effect this will have on an example ML pipeline and best practices to keep in mind when using this feature. Additionally, we will discuss ongoing work in progress to reduce the amount of computation required when tuning ML pipelines by eliminating redundant transformations and intelligently caching intermediate datasets. This can be combined with model parallelism to further reduce the run time of cross-validation for complex machine learning pipelines. Session hashtag: #DS6SAIS

Productionizing Spark ML Pipelines with the Portable Format for AnalyticsSummit 2018

The common perception of machine learning is that it starts with data and ends with a model. In real-world production systems, the traditional data science and machine learning workflow of data preparation, feature engineering and model selection, while important, is only one aspect. A critical missing piece is the deployment and management of models, as well as the integration between the model creation and deployment phases. This is particularly challenging in the case of deploying Apache Spark ML pipelines for low-latency scoring. While MLlib's DataFrame API is powerful and elegant, it is relatively ill-suited to the needs of many real-time predictive applications, in part because it is tightly coupled with the Spark SQL runtime. In this talk I will introduce the Portable Format for Analytics (PFA) for portable, open and standardized deployment of data science pipelines & analytic applications. I'll also introduce and evaluate Aardpfark, a library for exporting Spark ML pipelines to PFA, as well as compare and contrast it to other available alternatives including PMML, MLeap, ONNX and Apple's CoreML. Session hashtag: #ML1SAIS

Using Spark and Shark to Power a Real-time Recommendation and Customer Intelligence PlatformSummit 2014

The talk will cover how Graphflow uses Spark to power its real-time recommendation and customer intelligence platform. We will cover how we use Spark and MLlib to process and analyze customer behavior data for recommendation and predictive analytics models. We will also give an overview of using Spark and Shark to power data aggregation and analytics for customer insights and front-end data visualization apps.