Skip to main content

One of Apache Spark’s main goals is to make big data applications easier to write. Spark has always had concise APIs in Scala and Python, but its Java API was verbose due to the lack of function expressions. With the addition of lambda expressions in Java 8, we’ve updated Spark’s API to transparently support these expressions, while staying compatible with old versions of Java. This new support will be available in Apache Spark 1.0.

A Few Examples

The following examples show how Java 8 makes code more concise. In our first example, we search a log file for lines that contain “error”, using Spark’s filter and count operations. The code is simple to write, but passing a Function object to filter is clunky:

Java 7 search example:

(If you’re new to Spark, JavaRDD is a distributed collection of objects, in this case lines of text in a file. We can apply operations to these objects that will automatically be parallelized across a cluster.)

With Java 8, we can replace the Function object with an inline function expression, making the code a lot cleaner:

Java 8 search example:

The gains become even bigger for longer programs. For instance, the program below implements Word Count, by taking a file (read as a collection of lines), splitting each line into multiple words, then counting the words with a reduce function.

Java 7 word count:

With Java 8, we can write this program in just a few lines:

Java 8 word count:

We are very excited to offer this functionality, as it opens up the simple, concise programming style that Scala and Python Spark users are familiar with to a much broader set of developers.

Availability

Java 8 lambda support will be available in Apache Spark 1.0, which will be released in early May. Although using this syntax requires Java 8, Apache Spark 1.0 will still support older versions of Java through the old form of the API. Lambda expressions are simply a shorthand for anonymous inner classes, so the same API can be used in any Java version.

Learn More About Spark

If you’d like to learn more about Spark, the official documentation can help you get started today in either Java, Scala or Python. Spark is easy to run on your laptop, without any installation other than downloading and unzipping a release.

Try Databricks for free

Related posts

Scala at Scale at Databricks

December 3, 2021 by Li Haoyi in
With hundreds of developers and millions of lines of code, Databricks is one of the largest Scala shops around. This post will be...

Automatically Evolve Your Nested Column Schema, Stream From a Delta Table Version, and Check Your Constraints

We recently announced the release of Delta Lake 0.8.0 , which introduces schema evolution and performance improvements in merge and operational metrics in...

Introducing the Next-Generation Data Science Workspace

At today’s Spark + AI Summit 2020, we unveiled the next generation of the Databricks Data Science Workspace: An open and unified experience...
See all Engineering Blog posts