Skip to main content
Engineering blog

Apache Spark 1.2 introduced Machine Learning (ML) Pipelines to facilitate the creation, tuning, and inspection of practical ML workflows. Spark’s latest release, Spark 1.4, significantly extends the ML library.  In this post, we highlight  several new features in the ML Pipelines API, including:

  • A stable API — Pipelines have graduated from Alpha!
  • New feature transformers
  • Additional ML algorithms
  • A more complete Python API
  • A pluggable API for customized, third-party Pipeline components

If you’re new to using ML Pipelines, you can get familiar with the key concepts like Transformers and Estimators by reading our previous blog post.

New Features in Spark 1.4

With significant contributions from the Spark community, ML Pipelines are much more featureful in the 1.4 release.  The API includes many common feature transformers and more algorithms.

New Feature Transformers

A big part of any ML workflow is massaging the data into the right features for use in downstream processing.  To simply feature extraction, Spark provides many feature transformers out-of-the-box.  The table below outlines most of the feature transformers available in Spark 1.4 along with descriptions of each one. Much of the API is inspired by scikit-learn; for reference, we provide names of similar scikit-learn transformers where available.

TransformerDescriptionscikit-learn
BinarizerThreshold numerical feature to binaryBinarizer
BucketizerBucket numerical features into ranges 
ElementwiseProductScale each feature/column separately 
HashingTFHash text/data to vector. Scale by term frequencyFeatureHasher
IDFScale features by inverse document frequencyTfidfTransformer
NormalizerScale each row to unit normNormalizer
OneHotEncoderEncode k-category feature as binary featuresOneHotEncoder
PolynomialExpansionCreate higher-order featuresPolynomialFeatures
RegexTokenizerTokenize text using regular expressions(part of text methods)
StandardScalerScale features to 0 mean and/or unit varianceStandardScaler
StringIndexerConvert String feature to 0-based indicesLabelEncoder
TokenizerTokenize text on whitespace(part of text methods)
VectorAssemblerConcatenate feature vectorsFeatureUnion
VectorIndexerIdentify categorical features, and index 
Word2VecLearn vector representation of words 

* Only 3 of the above transformers were available in Spark 1.3 (HashingTF, StandardScaler, and Tokenizer).

The following code snippet demonstrates how multiple feature encoders can be strung together into a complex workflow. This example begins with two types of features: text (String) and userGroup (categorical).  For example:

table data

We generate text features using both hashing and the Word2Vec algorithm, and then apply a one-hot encoding to userGroup.  Finally, we combine all features into a single feature vector which can be used by ML algorithms such as Logistic Regression.

from pyspark.ml.feature import *
from pyspark.ml import Pipeline
tok = Tokenizer(inputCol="text", outputCol="words")
htf = HashingTF(inputCol="words", outputCol="tf", numFeatures=200)
w2v = Word2Vec(inputCol="text", outputCol="w2v")
ohe = OneHotEncoder(inputCol="userGroup", outputCol="ug")
va = VectorAssembler(inputCols=["tf", "w2v", "ug"], outputCol="features")
pipeline = Pipeline(stages=[tok,htf,w2v,ohe,va])

The following diagram shows the full pipeline. Pipeline stages are shown as blue boxes, and DataFrame columns are shown as bubbles.

simple pipeline

Better Algorithm Coverage

In Spark 1.4, the Pipelines API now includes trees and ensembles: Decision Trees, Random Forests, and Gradient-Boosted Trees.  These are some of the most important algorithms in machine learning.  They can be used for both regression and classification, are flexible enough to handle many types of applications, and can use both continuous and categorical features.

The Pipelines API also includes Logistic Regression and Linear Regression using Elastic Net regularization, an important statistical tool mixing L1 and L2 regularization.

Spark 1.4 also introduces OneVsRest (a.k.a. One-Vs-All), which converts any binary classification "base" algorithm into a multiclass algorithm.  This flexibility to use any base algorithm in OneVsRest highlights the versatility of the Pipelines API.  By using DataFrames, which support varied data types, OneVsRest can remain oblivious to the specifics of the base algorithm.

More Complete Python API

ML Pipelines have a near-complete Python API in Spark 1.4.  Python APIs have become much simpler to implement after significant improvements to internal Python APIs, plus the unified DataFrame API.  See the Python API docs for ML Pipelines for a full feature list.

Customizing Pipelines

We have opened up APIs for users to write their own Pipeline stages.  If you need a custom feature transformer, ML algorithm, or evaluation metric in your workflow, you can write your own and plug it into ML Pipelines.  Stages communicate via DataFrames, which act as a simple, flexible API for passing data through a workflow.

The key abstractions are:

  • Transformer: This includes feature transformers (e.g., OneHotEncoder) and trained ML models (e.g., LogisticRegressionModel).
  • Estimator: This includes ML algorithms for training models (e.g., LogisticRegression).
  • Evaluator: These evaluate predictions and compute metrics, useful for tuning algorithm parameters (e.g., BinaryClassificationEvaluator).

To learn more, start with the overview of ML Pipelines in the ML Pipelines Programming Guide.

Looking Ahead

The roadmap for Spark 1.5 includes:

  • API: More complete algorithmic coverage in Pipelines, and more featureful Python API.  There is also initial work towards an MLlib API in Spark R.
  • Algorithms: More feature transformers (such as CountVectorizer, DiscreteCosineTransform, MinMaxScaler, and NGram) and algorithms (such as KMeans clustering and Naive Bayes).
  • Developers: Improvements for developers, including to the feature attributes API and abstractions.

ML Pipelines do not yet cover all algorithms in MLlib, but the two APIs can interoperate.  If your workflow requires components from both APIs, all you need to do is convert between RDDs and DataFrames.  For more information on conversions, see the DataFrame guide.

Acknowledgements

Thanks very much to the community contributors during this release!  You can find a complete list of JIRAs for ML Pipelines with contributors on the Apache Spark JIRA.

Learning More

To get started, download Spark 1.4 and check out the ML Pipelines User Guide!  Also try out the ML package code examples. Experts can get started writing their own Transformers and Estimators by looking at the DeveloperApiExample code snippet.

To contribute, follow the MLlib 1.5 Roadmap JIRA.  Good luck!