Skip to main content

This is a guest blog from our friend Vincenzo Selvaggio who contributed this feature. He is a Senior Java Technical Architect and Project Manager, focusing on delivering advanced business process solutions for investment banks.


The recently released Apache Spark 1.4 introduces PMML support to MLlib for linear models and k-means clustering. This achievement is the result of active discussions from the community on JIRA (https://issues.apache.org/jira/browse/SPARK-1406) and GitHub (https://github.com/apache/spark/pull/3062) and embraces interoperability between Apache Spark and other platforms when it comes to predictive analytics.

What is PMML?

Predictive Model Markup Language (PMML) is the leading data mining standard developed by The Data Mining Group (DMG), an independent consortium, and it has been adopted by major  vendors and organizations (http://dmg.org/pmml/products.html). PMML uses XML to represent data mining models. A PMML document is an XML document with the following components:

  • a Header giving general information such as a description of the model and the application used to generate it
  • a DataDictionary containing the definition of fields used by the model
  • a Model defining the structure and the parameters of the data mining model

Why use PMML?

PMML allows users to build a model in one system, export it and deploy it in a different environment for prediction. In other words, it enables different platforms to speak the same language, removing the need for custom storage formats.

PMML
Image courtesy of Villu Ruusmann

What's more, adopting a standard encourages best practices (established ways of structuring models) and transparency (PMML documents are fully intelligible and not black boxes).

Why is Spark supporting PMML?

Building a model (producer) and scoring it (consumer) are two tasks very much decoupled as they require different systems and supporting infrastructure.

Model building is a complex task, it is performed on a large amount of historical data and requires a fast and scalable engine to produce correct results: this is where Apache Spark's MLlib shines.

Model scoring is performed by operational applications tuned for high throughput and detached from the analytical platform. Exporting MLlib's models in PMML enables sharing models between Spark and operational apps and is key for the success of predictive analytics.

A Code Example

In Spark, Exporting a data mining model to PMML is as simple as calling model.toPMML. Here a complete example, in Scala, of building a KMeansModel and exporting it to a local file:

import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile("/path/to/file")
.map(s => Vectors.dense(s.split(',').map(_.toDouble)))

// Cluster the data into three classes using KMeans
val numIterations = 20
val numClusters = 3
val kmeansModel = KMeans.train(data, numClusters, numIterations)

// Export clustering model to PMML
kmeansModel.toPMML("/path/to/kmeans.xml")

The PMML document generated is in this file: kmeans.pmml

For more examples of models exported and how those may be scored separately from Spark using the JPMML library, see

https://spark-packages.org/package/selvinsource/spark-pmml-exporter-validator.

Summary

With Apache Spark 1.4 PMML model export has been introduced, making MLlib interoperable with PMML compliant systems. You can find the supported models and how to export those to PMML from the official documentation page:

https://spark.apache.org/docs/latest/mllib-pmml-model-export.html. We want to thank everyone who helped review and QA the implementation.

There is still work to do for MLlib’s PMML support, for example, supporting PMML export for more models and add Python API. For more details, please visit https://issues.apache.org/jira/browse/SPARK-8545.

Try Databricks for free

Related posts

Introducing Apache Spark™ 3.5

Today, we are happy to announce the availability of Apache Spark™ 3.5 on Databricks as part of Databricks Runtime 14.0. We extend our...

ML Pipelines: A New High-Level API for MLlib

MLlib’s goal is to make practical machine learning (ML) scalable and easy. Besides new algorithms and performance improvements that we have seen in...

New Features in MLlib in Apache Spark 1.0

July 16, 2014 by Xiangrui Meng in
MLlib is an Apache Spark component focusing on machine learning. It became a standard component of Spark in version 0.8 (Sep 2013). The...
See all Engineering Blog posts