Data Science with SparkML on DataBricks is a perfect platform for application of Ensemble Learning on massive a scale. This talk will take you through a success story of development of a Prediction-as-a-Service platform which trains and predicts trends on 1 billion observed prices daily. In order to train ensemble model on a multivariate time series in 10s-of-millions-dimensional space, one has to fragment the whole space into subspaces which exhibit a significant similarity. In order to achieve this, the vastly sparse space has to undergo dimensionality reduction into a parameters space which then is used to cluster the observations. The data in the resulting clusters is modeled in parallel using machine learning tools capable of coefficient estimation at the massive scale (SparkML and Scikit Learn). The estimated model coefficients are stored in a database to be used when executing predictions on demand via a web service. This approach enables training models fast enough to complete the task within a couple of hours, allowing daily or even real time updates of the coefficients. The above machine learning framework is used to predict the airfares used as support tool for the airline Revenue Management systems.
Josef Habdank is a Lead Data Scientist and Data Platform Architect at as INFARE Solutions and passionate champion of Apache Spark. He has experience from BigData practitioners such as Department of Defence, Thomson Reuters and Adform. He is an expert in Apache Spark and Spark enabled technologies such as Kafka, Kinesis, MemSql, Alluxio and others. Additionally he is a specialist in real time non linear forecasting, and has experience with with systems processing tens of billions of data points daily and data warehouses holding hundreds of billions of rows.