Skip to main content

Apache Spark and the Lightbend Reactive Platform: A Match Made in Heaven

When I started working with Hadoop several years ago, it was frustrating to find that writing Hadoop jobs was hard to do. If your problem fits a query model, then Hive provides a SQL-based scripting tool. For many common dataflow problems, Pig provides useful abstractions, but it isn't a full-fledged, "Turing-complete" language. Otherwise, you had to use the low-level Hadoop MapReduce API. Some third-party APIs exist that wrap the MapReduce API, such as Cascading and Scalding, but they couldn't fix MapReduce's performance problems.

Spark - The New Big Data Compute Engine

But interest in an alternative, Apache Spark, was growing. Now, Spark has emerged as the next-generation platform for writing Big Data applications for Hadoop and Mesos clusters.

Spark is replacing the venerable Hadoop MapReduce for several reasons:

Performance

Spark's Resilient Distributed Datasets (RDDs), which are fault-tolerant, distributed collections of data that can be manipulated in parallel. RDDs exploit intelligent, in-memory caching of data that avoids unnecessary round trips to disk, writes followed by reads, which are common in non-trivial MapReduce jobs where map and reduce steps are sequenced together.

Natural Data Processing Idioms

Spark provides a powerful set of composable building blocks for writing concise, yet powerful queries and dataflows. While the MapReduce API can be used to write a wide-range of computations, translating many algorithms to the API can be very difficult, requiring special expertise. In contrast, the concise Scala, Java, and Python APIs provided by Spark make developers highly productive.

Streaming vs. Batch-mode Operations

MapReduce only supports batch-mode operations. Increasingly, data teams need more "real-time" processing of event streams. Rather than turning to yet another tool for this purpose, Spark lets you writes streaming and batch-mode applications with very similar logic and APIs.

What Makes Spark so Successful?

Part of Spark's success is due to the foundation it is built upon, components of the Lightbend Reactive Platform. First, there's Scala, the flexible, object-functional language for the JVM. People often ask Matei Zaharia, the creator of Spark and the co-founder of Databricks, why he chose Scala. Here is a recent answer he gave to the question:

Quite a few people ask this question and the answer is pretty simple. When we started Spark, we had two goals — we wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsoft’s DryadLINQ (the first language-integrated big data framework I know of, that begat things like FlumeJava and Crunch). On the JVM, the only language that would offer that kind of API was Scala, due to its ability to capture functions and ship them across the network. Scala’s static typing also made it much easier to control performance compared to, say, Jython or Groovy.

The second Lightbend component in Spark's foundation is Akka, a toolkit and runtime for building highly-concurrent, distributed, and fault tolerant event-driven applications on the JVM.

Spark exploits Akka's distributed, fine-grained, flexible, and dynamic Actor model to build resilient, distributed components for managing and processing data.

Lightbend and Databricks, Working Together

The combination of Apache Spark and the Lightbend Reactive Platform, including Scala, Akka, Play, and Slick, gives Enterprise developers a comprehensive suite of tools for building Certified on Spark applications with minimal effort that are highly scalable and resilient.

Lightbend will continue to build tools that help make Spark great and Databricks successful. We'll also work to make the developer experience seamless between our tools.

For starters, I encourage you to check out our growing Lightbend Activator templates for Spark, especially my introductory Spark Workshop, which is our first Certified on Spark application.

Try Databricks for free

Related posts

A Tale of Three Apache Spark APIs: RDDs vs DataFrames and Datasets

July 14, 2016 by Jules Damji in
Of all the developers' delight, none is more attractive than a set of APIs that make developers productive, that is easy to use...

Make Your Data Lakehouse Run, Faster With Delta Lake 1.1

Delta Lake 1.1 improves performance for merge operations, adds the support for generated columns and improves nested field resolution With the tremendous contributions...

Introducing Apache Spark Datasets

Developers have always loved Apache Spark for providing APIs that are simple yet powerful, a combination of traits that makes complex analysis possible...
See all Company Blog posts