No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark

Download Slides

Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF’17 (#SFds5) but from more technical perspective.
Session hashtag: #EUai9

About Marcin Kulka

Marcin Kulka is a Senior Software Engineer in 9LivesData. In cooperation with NEC Labs America machine learning researchers, he works on Spark-based fully automated predictive modelling system. He holds master's degree in both Computer Science and Mathematics from Warsaw University. His biggest areas of interests are big data, machine learning, distributed systems and algorithms. Marcin has almost 10 years of professional experience in software engineering, most of which spent working on HYDRAstor - cutting edge, distributed and highly scalable backup system. Privately he is happy husband and father of two daughters.

About MichaƂ Kaczmarczyk

Michal Kaczmarczyk (Ph.D.) is leading a development team implementing Spark-based fully automated predictive modeling system in cooperation with NEC Laboratories America. Michal received his PhD from Warsaw University and is exploring the field of distributed systems from year 2005. He worked for companies such as NEC Labs (Princeton, NJ), Microsoft (Redmond, WA), 9LivesData (Warsaw, currently). During this time he worked on core system components and published research papers on conferences such as FAST and SYSTOR. Since 2015 devoted to Spark and charmed with Scala.