Building Large Scale Machine Learning Applications with Pipelines

Download Slides

Real world machine learning applications typically consist of many components in a data processing pipeline. For example, in text classification, preprocessing steps like n-gram extraction, and TF-IDF feature weighting are often necessary before training of a classification model like an SVM. We describe a framework for constructing these ML Pipelines and show how it can help us construct end-to-end workflows with a toolbox of off-the-shelf components which we have developed for text, image classification and a high-performance linear algebra library that we use for training models. We show that with this framework we can get state-of-the-art results in many machine learning tasks. Our scalable implementation on Spark outperforms supercomputing installations and can match deep learning error rates on speech recognition in less than 1 hour on EC2 for $20. Finally, we discuss research in the AMPLab to support common iterative machine learning workflows by careful resource estimation and checkpoint planning.