Pivotal Hadoop Integrates the Full Apache Spark Stack

Published: May 23, 2014

This post is guest authored by our friends at Pivotal describing why they’re excited to deliver Apache Spark on their world class Pivotal HD big data analytics platform suite.

Today, we are excited to announce the immediate availability of the full Apache Spark stack on Pivotal HD. We have been impressed with the rapid adoption of Spark as a replacement for Hadoop’s more traditional processing engines as well as its vibrant ecosystem, and are thrilled to make it possible for Pivotal customers to run Apache Spark on Pivotal HD Hadoop. Just as important is how we’re doing it: Pivotal HD will be part of Databricks’ upcoming certification program – meaning a commitment to provide compatibility with Apache Spark and support the growing ecosystem of Spark applications.

PivotalHD and Spark

Unlike a multi-vendor patchwork of heterogeneous solutions, Pivotal brings together an integrated full stack of technologies to allow enterprises to create a Business Data Lake. Pivotal HD 2.0.1 consists of a Hadoop distribution that is compatible with Apache Hadoop 2.x, a market-leading SQL on Hadoop query engine in HAWQ, and GemfireXD for in-memory data serving and ultra-low latency transaction processing capabilities. Together these platforms extend Pivotal’s differentiation in both the Hadoop ecosystem and the more established data warehousing markets, meeting the full spectrum of analytics requirements from batch to ultra-low latency.

With Spark, Pivotal aims to further extend this differentiation by leveraging Spark’s cutting edge capabilities and integrating it with the rest of Pivotal’s world-class platform. Much has been written about Spark’s benefits, but what really drew us to it were the following characteristics:

Speed: Spark can process HDFS data in-place up to 100x faster than Hadoop MapReduce using it’s in-memory-optimized architecture
Unification: Out-of-the-box Spark provides a wide breath of functionality – including streaming data support, machine learning, and graph computation – which when combined with PivotalHD give customers a full end-to-end experience
Ease of use: Spark enables developers to use Java, Scala, or Python across their entire workflow; additionally it exposes 80+ high-level operators that allow it to have 2-5x less code than similar MapReduce jobs

Though the traditional Hadoop processing components such as MapReduce, Pig, and Hive will remain part of PivotalHD, we imagine many customers will begin using Spark instead because of these benefits.

Pivotal and Open Source

Open source is a critical part of Pivotal’s DNA. Pivotal has been committed to open source software through active involvement in open-source projects such as Tomcat, RabbitMQ, Redis, Hadoop, Cloud Foundry, Spring, Grails, MADlib, Chorus, and Groovy – Spark will be no different.

One of the main attractions of Spark for us is the growing community and ecosystem. With nearly 200 contributors over the past 12 months, it is one of the most active projects in the Apache and Hadoop open-source ecosystem. Even more exciting is the potential that the ecosystem of applications built on top of Spark holds (something that we’re obviously passionate about at Pivotal); new “powered-by-spark” applications seem to be emerging daily.

However, we’ve seen how quickly this potential can diminished with forking and fragmentation in open source projects before. That is why we’re excited to join Databricks in their efforts to unify the community. Pivotal’s distribution of Spark provides full compatibility with Apache Spark, enabling the growing set of “Certified on Spark” applications to run on it out of the box. Given the benefits for our customers, and the open and transparent nature of the process, this was an easy decision. This effort is yet another testament of Pivotal’s commitment to open source innovation that brings value to customers.

Pivotal and Databricks

Databricks was founded by the original team that developed Apache Spark, and is currently the driving force behind the project. When Pivotal decided to certify its distribution with full Apache Spark stack with PivotalHD and increase our involvement in the Spark community, we could think of no better ally than Databricks with whom to embark on this exciting journey. Furthermore, we’re thrilled to join their effort to maintain compatibility across the growing Spark ecosystem. We’re excited to be making this announcement on the Databricks blog, and look forward to a long and deep relationship with our friends at Databricks.

Getting Started

Try out Pivotal’s Spark bundle on Pivotal HD 2.0.1 by obtaining the Pivotal Spark tarball and quick-start instructions here. The Pivotal HD 2.0.1 release is available for download here. We would love to hear from you and welcome the opportunity to engage in a dialog. Please feel free to drop us a note at [email protected] if you have any questions, or if we can be of any help for your intended use case.

Additionally, make sure you come visit our booth at the upcoming Spark Summit to hear more about using Pivotal HD – now with Spark!

PivotalHD and Spark

Pivotal and Open Source

Your compact guide to modern analytics

Pivotal and Databricks

Getting Started

Never miss a Databricks post

Sign up