by Joseph Bradley and Burak Yavuz
Apache Spark 1.2 introduced Machine Learning (ML) Pipelines to facilitate the creation, tuning, and inspection of practical ML workflows. Spark’s latest release, Spark 1.4, significantly extends the ML library. In this post, we highlight several new features in the ML Pipelines API, including:
If you’re new to using ML Pipelines, you can get familiar with the key concepts like Transformers and Estimators by reading our previous blog post.
With significant contributions from the Spark community, ML Pipelines are much more featureful in the 1.4 release. The API includes many common feature transformers and more algorithms.
A big part of any ML workflow is massaging the data into the right features for use in downstream processing. To simply feature extraction, Spark provides many feature transformers out-of-the-box. The table below outlines most of the feature transformers available in Spark 1.4 along with descriptions of each one. Much of the API is inspired by scikit-learn; for reference, we provide names of similar scikit-learn transformers where available.
Transformer | Description | scikit-learn |
Binarizer | Threshold numerical feature to binary | Binarizer |
Bucketizer | Bucket numerical features into ranges | |
ElementwiseProduct | Scale each feature/column separately | |
HashingTF | Hash text/data to vector. Scale by term frequency | FeatureHasher |
IDF | Scale features by inverse document frequency | TfidfTransformer |
Normalizer | Scale each row to unit norm | Normalizer |
OneHotEncoder | Encode k-category feature as binary features | OneHotEncoder |
PolynomialExpansion | Create higher-order features | PolynomialFeatures |
RegexTokenizer | Tokenize text using regular expressions | (part of text methods) |
StandardScaler | Scale features to 0 mean and/or unit variance | StandardScaler |
StringIndexer | Convert String feature to 0-based indices | LabelEncoder |
Tokenizer | Tokenize text on whitespace | (part of text methods) |
VectorAssembler | Concatenate feature vectors | FeatureUnion |
VectorIndexer | Identify categorical features, and index | |
Word2Vec | Learn vector representation of words |
* Only 3 of the above transformers were available in Spark 1.3 (HashingTF, StandardScaler, and Tokenizer).
The following code snippet demonstrates how multiple feature encoders can be strung together into a complex workflow. This example begins with two types of features: text (String) and userGroup (categorical). For example:
We generate text features using both hashing and the Word2Vec algorithm, and then apply a one-hot encoding to userGroup. Finally, we combine all features into a single feature vector which can be used by ML algorithms such as Logistic Regression.
The following diagram shows the full pipeline. Pipeline stages are shown as blue boxes, and DataFrame columns are shown as bubbles.
In Spark 1.4, the Pipelines API now includes trees and ensembles: Decision Trees, Random Forests, and Gradient-Boosted Trees. These are some of the most important algorithms in machine learning. They can be used for both regression and classification, are flexible enough to handle many types of applications, and can use both continuous and categorical features.
The Pipelines API also includes Logistic Regression and Linear Regression using Elastic Net regularization, an important statistical tool mixing L1 and L2 regularization.
Spark 1.4 also introduces OneVsRest (a.k.a. One-Vs-All), which converts any binary classification "base" algorithm into a multiclass algorithm. This flexibility to use any base algorithm in OneVsRest highlights the versatility of the Pipelines API. By using DataFrames, which support varied data types, OneVsRest can remain oblivious to the specifics of the base algorithm.
ML Pipelines have a near-complete Python API in Spark 1.4. Python APIs have become much simpler to implement after significant improvements to internal Python APIs, plus the unified DataFrame API. See the Python API docs for ML Pipelines for a full feature list.
We have opened up APIs for users to write their own Pipeline stages. If you need a custom feature transformer, ML algorithm, or evaluation metric in your workflow, you can write your own and plug it into ML Pipelines. Stages communicate via DataFrames, which act as a simple, flexible API for passing data through a workflow.
The key abstractions are:
To learn more, start with the overview of ML Pipelines in the ML Pipelines Programming Guide.
The roadmap for Spark 1.5 includes:
ML Pipelines do not yet cover all algorithms in MLlib, but the two APIs can interoperate. If your workflow requires components from both APIs, all you need to do is convert between RDDs and DataFrames. For more information on conversions, see the DataFrame guide.
Thanks very much to the community contributors during this release! You can find a complete list of JIRAs for ML Pipelines with contributors on the Apache Spark JIRA.
To get started, download Spark 1.4 and check out the ML Pipelines User Guide! Also try out the ML package code examples. Experts can get started writing their own Transformers and Estimators by looking at the DeveloperApiExample code snippet.
To contribute, follow the MLlib 1.5 Roadmap JIRA. Good luck!