Skip to main content

With over 500 paying customers, my team and I have the opportunity to talk to many organizations that are leveraging Hadoop in production to extract value from big data. One of the most common topics raised by our customers in recent months is Apache Spark. Some customers just want to learn more about the advantages of this technology and the use cases that it addresses, while others are already running it in production with the MapR Distribution. These customers range from the world’s largest cable telcos and retailers to Silicon Valley startups such as Quantifind, which recently talked about its use of Spark on MapR in an interview with Stefan Groschupf, CEO of Datameer.

Today, I am happy to announce and share with you the beginning of our journey with Databricks, and the addition of the complete Spark stack to the MapR Distribution for Apache Hadoop. We are now the only Hadoop distribution to support the complete Spark stack, including Spark, Spark Streaming (stream processing), Shark (Hive on Spark), MLLib (machine learning) and GraphX (graph processing). This is a testament to our commitment to open source and to providing our customers with maximum flexibility to pick and choose the right tool for the job.

Why Spark?

One of the challenges organizations face when adopting Hadoop is a shortage of developers who have experience building Hadoop applications. Our professional services organization has helped dozens of companies with the development and deployment of Hadoop applications, and our training department has trained countless engineers. Organizations are hungry for solutions that make it easier to develop Hadoop applications while increasing developer productivity, and Spark fits this bill. Spark jobs can require as little as 1/5th of code. Spark provides a simple programming abstraction allowing developers to design applications as operations on data collections (known as RDDs, or Resilient Distributed Datasets). Developers can build these applications in multiple programming languages, including Java, Scala and Python, and the same code can be reused across batch, interactive and streaming applications.

In addition to making developers happier and more productive, Spark provides significant benefits with respect to end-to-end application performance. To this end, Spark provides a general-purpose execution framework with in-memory pipelining. For many applications, this results in a 5-100x performance improvement, because some or all steps can execute in memory without unnecessarily writing to and reading from disk. The performance advantage of the Spark engine, combined with the industry-leading performance of the MapR Distribution, provides customers with the highest-performance platform for big data applications.

Why Databricks?

Databricks was founded by the team that started the Spark research project at UC Berkeley that later became Apache Spark. When we decided to add the Spark stack to our distribution and double down on our involvement in the Spark community, a strategic partnership with Databricks was a no-brainer. This partnership will benefit MapR customers who are interested in 24x7 support for Spark or any of the other projects in the stack, including Spark Streaming, Shark, MLLib and GraphX (with several other projects coming soon). In addition, MapR will be working closely with Databricks to drive the Spark roadmap and accelerate the development of new features, benefiting both MapR customers and the broader community.

We are very excited about the upcoming Apache Spark 1.0 release, expected later this month. We are looking forward to a great journey with Databricks and the other members of the Spark community.

Register for an upcoming joint webinar to learn more about the benefits of the complete Spark stack on MapR.

Try Databricks for free

Related posts

Apache Spark Key Terms, Explained

June 22, 2016 by Jules Damji and Denny Lee in
This article was originally posted on KDnuggets The Spark Summit Europe call for presentations is open, submit your idea today As observed in...

Databricks is now Generally Available

June 15, 2015 by Ion Stoica and Matei Zaharia in
We are excited to announce today, at Spark Summit 2015 , the general availability of the Databricks – a hosted data platform from...

A Deep Dive into the Latest Performance Improvements of Stateful Pipelines in Apache Spark Structured Streaming

This post is the second part of our two-part series on the latest performance improvements of stateful pipelines. The first part of this...
See all Company Blog posts