Apache Spark for Machine Learning with High Dimensional Labels - Databricks

Apache Spark for Machine Learning with High Dimensional Labels

Download Slides

This talk will cover the tools we used, the hurdles we faced and the work arounds we developed with the help from Databricks support in our attempt to build a custom machine learning model and use it to predict the TV ratings for different networks and demographics.
The Apache Spark machine learning and dataframe APIs make it incredibly easy to produce a machine learning pipeline to solve an archetypal supervised learning problem. In our applications at Cadent, we face a challenge with high dimensional labels and relatively low dimensional features; at first pass such a problem is all but intractable but thanks to a large number of historical records and the tools available in Apache Spark, we were able to construct a multi-stage model capable of forecasting with sufficient accuracy to drive the business application.

Over the course of our work we have come across many tools that made our lives easier, and others that forced work around. In this talk we will review our custom multi-stage methodology, review the challenges we faced and walk through the key steps that made our project successful.

About Stefan Panayotov

Stefan in his current role as a Data Engineer at Cadent, focuses on Big Data computational platform solutions like Spark, that enables Cadent to leverage the Data Science and Machine Learning tasks for achieving faster and better business results. Previous to Cadent, Stefan was an Application Developer at QVC, where he worked on building logistic and warehouse software solutions for the retail industry. He’s also spent time as a SQL Developer at CCP, Senior Software Analyst at EXE Technologies, and an IT Consultant at UNISYS. Stefan received his PhD in Computer Science at the Bulgarian Academy of Sciences, where he also served as an Assistant Professor.

About Michael Zargham

Michael has a PhD in Optimization and Decision Science from the University of Pennsylvania with a focus on constrained resource allocation problems. Michael leads the Data Science and Engineering initiatives at Cadent, a leading provider of media, advertising technology and data solutions for the pay-TV industry. He has also taught Convex Optimization at UPenn. He has been a practicing data driven business architect since 2005, working on various subcontracts during his undergraduate and graduate work.