A Journey from Scikit-learn to Spark

Download Slides

Zalando is Europe’s leading online fashion retailer and currently on its way to become the platform for all fashion related business — from designers to large scale logistics solutions. The new platform architecture challenges the company’s current in-house solutions to become more scalable, dependable and versatile. This talk describes our journey of rewriting an in-production classification system from scratch using Scala and Spark to run on AWS. Along the way, we will look at the drawbacks that are inherent to our old Python/Scikit-learn based solution running a static cluster, most prominently: hard maintenance (technological debt), data bottlenecks, too coarse-grained parallelisation. Next, we will present our new Spark based solution and demonstrate how we were able to mitigate the previously identified pain points by leveraging the features that Scala and Spark bring into play, in particular: strong typing, data parallelisation and easy scale out. To measure the gain of our new solution, we will provide an in depth comparison of both solutions. For this purpose, we conducted measurements that highlight the performance gains we experienced with Spark, including learning and prediction times. The talk concludes with an insight into the lessons we learned.

About Stanimir Dragiev

Stanimir Dragiev obtained a Diploma in Computer Science from the Technische Universität (TU) Berlin in 2009, posing Grid workflow recovery as constraint satisfaction problem. He was awarded a PhD in Machine Learning and Robotics from the Universität Stuttgart in 2014 for his thesis titled An object representation and methods for uncertainty aware shape estimation and Grasping. In autumn 2014, Stanimir joined Zalando as a data scientist and has since worked on machine learning models and infrastructure for fraud detection and prevention.