Vu Pham is a machine learning software engineer at Arimo, with focus in deep learning. He helps build Arimo’s deep learning solutions. He is an avid contributor to various open-source projects such as cubgs, Deepnet, and deeplearning4j. Prior to Arimo, he worked in academia and industry, and authored and co-authored several scientific papers.
Automatic features generation has been a long-standing research problem in Machine Learning. The idea is to program the machines to automatically extract most relevant features which are then used to train predictive models, with minimal input from Data Scientists. This approach might significantly improve Data Scientist's productivity by utilizing feature engineering best practices, as well as lifting them from the burden of tuning hyper-parameters of predictive models. Inspired by recent work on this topic, we present our system built on top of Spark, in which datasets are processed, new features are generated, transformed and evaluated by strong well-known statistical techniques, then fed into predictive Machine Learning models. The whole process is tuned using Bayesian Optimization, a generic framework for tuning configuration parameters. We show that the system works at large scale and obtain significant performance on various datasets. At a higher level of abstraction, we also present our point of view about the general trend of automating parts of Data Scientist's job, helping them to be more productive and creative.