Apache Spark and the Typesafe Reactive Platform: A Match Made in Heaven
When I started working with Hadoop several years ago, it was frustrating to find that writing Hadoop jobs was hard to do. If your problem fits a query model, then Hive provides a SQL-based scripting tool. For many common dataflow problems, Pig provides useful abstractions, but it isn’t a full-fledged, “Turing-complete” language. Otherwise, you had to use the low-level Hadoop MapReduce API. Some third-party APIs exist that wrap the MapReduce API, such as Cascading and Scalding, but they couldn’t fix MapReduce’s performance problems.
Spark – The New Big Data Compute Engine
But interest in an alternative, Apache Spark, was growing. Now, Spark has emerged as the next-generation platform for writing Big Data applications for Hadoop and Mesos clusters.
Spark is replacing the venerable Hadoop MapReduce for several reasons:
Spark’s Resilient Distributed Datasets (RDDs), which are fault-tolerant, distributed collections of data that can be manipulated in parallel. RDDs exploit intelligent, in-memory caching of data that avoids unnecessary round trips to disk, writes followed by reads, which are common in non-trivial MapReduce jobs where map and reduce steps are sequenced together.
Natural Data Processing Idioms
Spark provides a powerful set of composable building blocks for writing concise, yet powerful queries and dataflows. While the MapReduce API can be used to write a wide-range of computations, translating many algorithms to the API can be very difficult, requiring special expertise. In contrast, the concise Scala, Java, and Python APIs provided by Spark make developers highly productive.
Streaming vs. Batch-mode Operations
MapReduce only supports batch-mode operations. Increasingly, data teams need more “real-time” processing of event streams. Rather than turning to yet another tool for this purpose, Spark lets you writes streaming and batch-mode applications with very similar logic and APIs.
What Makes Spark so Successful?
Part of Spark’s success is due to the foundation it is built upon, components of the Typesafe Reactive Platform. First, there’s Scala, the flexible, object-functional language for the JVM. People often ask Matei Zaharia, the creator of Spark and the co-founder of Databricks, why he chose Scala. Here is a recent answer he gave to the question:
Quite a few people ask this question and the answer is pretty simple. When we started Spark, we had two goals — we wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsoft’s DryadLINQ (the first language-integrated big data framework I know of, that begat things like FlumeJava and Crunch). On the JVM, the only language that would offer that kind of API was Scala, due to its ability to capture functions and ship them across the network. Scala’s static typing also made it much easier to control performance compared to, say, Jython or Groovy.
The second Typesafe component in Spark’s foundation is Akka, a toolkit and runtime for building highly-concurrent, distributed, and fault tolerant event-driven applications on the JVM.
Spark exploits Akka’s distributed, fine-grained, flexible, and dynamic Actor model to build resilient, distributed components for managing and processing data.
Typesafe and Databricks, Working Together
The combination of Apache Spark and the Typesafe Reactive Platform, including Scala, Akka, Play, and Slick, gives Enterprise developers a comprehensive suite of tools for building Certified on Spark applications with minimal effort that are highly scalable and resilient.
Typesafe will continue to build tools that help make Spark great and Databricks successful. We’ll also work to make the developer experience seamless between our tools.