Learning to Rank with Apache Spark: A Case Study in Production Machine Learning - Databricks

Learning to Rank with Apache Spark: A Case Study in Production Machine Learning

Download Slides

At Elsevier we work to improve the lives of scientists and connect them with the most relevant research in their field. As part of that we built a large scale recommendation engine for ScienceDirect, which helps millions of users discover papers relevant to their research. This presentation is about how our data scientist and engineers designed and implemented the recommendation engine using Apache Spark. We give an overview of the recommendation engine and its two main components – item based collaborative filtering (IBFC) and a learning to rank (LtR) algorithm that employs user feedback and feature engineering to improve how we rank recommendations.

We talk about how the IBCF algorithm is implemented within Spark and how the resulting recommendations can be significantly improved through adding LtR as a rescoring mechanism. We talk about how we train and tune the model and what makes us consider the model “production ready”. The choice of Spark has helped to speed up development by providing a lingua franca between data scientist and engineers and help them to work together effectively. Spark has also allowed us to generate and deploy the model into production and generate billions of high-quality article recommendations every day.

Key takeaways: – Building a good recommendation system is a complex issue, but Spark can greatly help improve the time needed to create and evaluate machine learning models. – Spark provides common language between data scientist and data engineers. Having both those groups communicate can speed up development and improve the end product.

Session hashtag: #SAISML12

About Adam Davidson

Adam has been a senior data engineer with Elsevier for 2 years, having worked in consulting roles for the previous 5. Adam works with Spark on a daily basis across a range of recommender systems and feels very fortunate to be working at the interface of enterprise production systems and cutting-edge machine learning techniques.

About Anna Bladzich

Anna is a senior data engineer at Elsevier. She has been a Scala developer for 4 years, working for start-ups before joining the world of research. Anna is passionate about the community and actively champions diversity in technology at Elsevier. On daily basis Anna works on various recommendation systems utilising the latest research in data science and machine learning.