Vladimir Feinberg is the Head of Machine Learning at Sisu Data, where he leads the investigation and development of algorithms for large-scale streaming structured data. Prior to Sisu, Vlad was a graduate student in the UC Berkeley Ph.D. program, advised by professors Michael I. Jordan, Ion Stoica, and Joseph E. Gonzalez, focusing on systems and machine learning. Vlad graduated from Princeton University, working with Barbara Engelhardt on Gaussian process estimation procedures and Kai Li on 3D convolutional neural net optimization.
June 25, 2020 05:00 PM PT
Modern enterprise data---tracking key performance indicators like conversions or click-throughs---exhibits a pathologically high dimensionality, which requires re-thinking data representation to make analysis tractable. For instance, the Malicious URL dataset has over 3.2 million categorical feature columns with a high degree of sparsity. Typical approaches for supervised machine learning, e.g., a simple model such as a logistic regression with interaction terms, don't function at all without adaptation, as the parameters alone would require over 80TB of data to store. Specialized representations, like compressed sparse row format, can be integrated with sparsity-aware procedures, such as those found in XGBoost. However, sparse representation still incurs significant runtime costs and requires adhering to a subset of modelling approaches, such as decision trees or field-aware factorization machines. We demonstrate a chromatic approach to sparse learning, which uses approximate graph coloring to significantly collapse dataset width. By identifying the structure of mutual exclusivity between sparse columns, we can collapse the categorical features of the Malicious URL dataset to 499 dense columns, opening it up to application of a much broader set of machine learning algorithms. Even on sparse-capable methods, such as XGBoost, the use of an equivalent dense representation alone yields a 2x training speedup without any performance loss.