Introducing Apache Spark 2.2

Published: July 11, 2017

Today we are happy to announce the availability of Apache Spark 2.2.0 on Databricks as part of the Databricks Runtime 3.0.

This release marks a major milestone for Structured Streaming by marking it as production ready and removing the experimental tag. In this release, we also support for arbitrary stateful operations in a stream, and Apache Kafka 0.10 support for both reading and writing using the streaming and batch APIs. In addition to extending new functionality to SparkR, Python, MLlib, and GraphX, the release focuses on usability, stability, and refinement, resolving over 1100 tickets.

This blog post discusses some of the high-level changes, improvements and bug fixes:

Production ready Structured Streaming
Expanding SQL functionalities
New distributed machine learning algorithms in R
Additional Algorithms in MLlib and GraphX

Structured Streaming

Introduced in Spark 2.0, Structured Streaming is a high-level API for building continuous applications. Our goal is to make it easier to build end-to-end streaming applications, which integrate with storage, serving systems, and batch jobs in a consistent and fault-tolerant way.

The third release in 2.x line, Spark 2.2 declares Structured Streaming as production ready, meaning removing the experimental tag, with additional high-level changes:

Kafka Source and Sink: Support for reading and writing data in streaming or batch to and from Apache Kafka
Kafka Improvements: Cached producer for lower latency Kafka to Kafka streams
Additional Stateful APIs: Support for complex stateful processing and timeouts using [flat]MapGroupsWithState
Run Once Triggers: Allows to trigger only one-time execution, hence lowering the cost of clusters

At Databricks, we religiously believe in dogfooding. Using a release candidate version of Spark 2.2, we have ported some of our internal data pipelines as well as worked with some of our customers to port their production pipelines using Structured Streaming.

SQL and Core APIs

Since Spark 2.0 release, Spark is now one of the most feature-rich and standard-compliant SQL query engine in the Big Data space. It can connect to a variety of data sources and perform SQL-2003 feature sets such as analytic functions and subqueries. Spark 2.2 adds a number of SQL functionalities:

API Updates: Unify CREATE TABLE syntax for data source and hive serde tables and add broadcast hints such as BROADCAST, BROADCASTJOIN, and MAPJOIN for SQL Queries
Overall Performance and stability:
- Cost-based optimizer cardinality estimation for filter, join, aggregate, project and limit/sample operators and Cost-based join re-ordering
- TPC-DS performance improvements using star-schema heuristics
- File listing/IO improvements for CSV and JSON
- Partial aggregation support of HiveUDAFFunction
- Introduce a JVM object based aggregate operator
Other notable changes:
- Support for parsing multi-line JSON and CSV files
- Analyze Table Command on partitioned tables
- Drop Staging Directories and Data Files after completion of Insertion/CTAS against Hive-serde Tables

MLlib, SparkR, and Python

The last major set of changes in Spark 2.2 focuses on advanced analytics and Python. Now you can install PySpark from PyPI package using pip install. To boost advanced analytics, a few new algorithms were added to MLlib and GraphX:

Locality Sensitive Hashing
Multiclass Logistic Regression
Personalized PageRank

Spark 2.2 also adds support for the following distributed algorithms in SparkR:

ALS
Isotonic Regression
Multilayer Perceptron Classifier
Random Forest
Gaussian Mixture Model
LDA
Multiclass Logistic Regression
Gradient Boosted Trees
Structured Streaming API for R
column functions to_json, from_json for R
Multi-column approxQuantile in R

With the addition of these algorithms, SparkR has become the most comprehensive library for distributed machine learning on R.

While this blog post only covered some of the major features in this release, you can read the official release notes to see the complete list of changes.

If you want to try out these new features, you can use Spark 2.2 in Databricks Runtime 3.0. Sign up for a free trial account here.

What's next?

March 22, 2024/10 min read

GGML GGUF File Format Vulnerabilities

June 5, 2024/3 min read

Structured Streaming

SQL and Core APIs

MLlib, SparkR, and Python

Never miss a Databricks post

Sign up

What's next?

GGML GGUF File Format Vulnerabilities

BigQuery adds first-party support for Delta Lake