Skip to main content

Today, Cloudera announced that it will distribute and support Apache Spark. We are very excited about this announcement, and what it brings to the Spark platform and the open source community. So what does this announcement mean for Spark?

First, it validates the maturity of the Spark platform. Started as a research project at UC Berkeley in 2009, Spark is the first general purpose cluster computing engine that can run sophisticated computations at memory speeds on Hadoop clusters. Spark started with the goal of providing efficient support for iterative algorithms (such as machine learning) and interactive queries, workloads not well supported by MapReduce. Since then, Spark has grown to support other applications such as streaming, and has gained rapid industry adoption. Today, Spark is used in production by numerous companies, and it counts on an ever growing open source community with over 90 contributors from 25 companies.

Second, it will make the Spark platform available to a wide range of enterprise customers both in US and internationally. By being distributed in conjunction with Cloudera’s CDH, Spark will enjoy the same enterprise-grade support as the other components in Cloudera’s stack. Databricks is fully committed to working with Cloudera to guarantee that its customers will have the best possible support. Furthermore, we are looking forward to this partnership to enable new categories of exciting applications, and address unique usage scenarios.

Third, this partnership underlines and strengthens the integration of Spark into the Hadoop ecosystem. Spark has the ability to read Hadoop files, share data with other Hadoop frameworks, and support existing Hadoop workloads, including Hive queries. This integration is beneficial not only for Spark, but for the Hadoop ecosystem as a whole, as Spark brings new capabilities to the Hadoop ecosystem through its ability to run on top of Hadoop YARN. This could give Hadoop users the opportunity to run jobs up to 100x faster than MapReduce, while writing 2-5x less code. For example, a data scientist could leverage Spark’s simple yet powerful API to rapidly develop machine learning algorithms, and then run them at memory speeds on her Hadoop data.

Finally, we want to reiterate our full commitment to open source. The success Spark has enjoyed thus far has only been possible because of a vibrant open source community, who has contributed a continuous stream of new functionality and bug fixes. We believe this partnership will ignite a new wave of growth of our community and accelerate the development of the Apache Spark platform to support an ever growing number of customers.

We have no doubt that we are just at the beginning of a journey to give users the tools to solve tomorrow’s big data challenges. The next stop in this journey is the Spark Summit. Sponsored by leading big data companies and Spark users, including Databricks and Cloudera, this is the first conference that will bring together the Spark community. Come and join us on this journey!

Try Databricks for free

Related posts

Top Considerations When Migrating Off of Hadoop

July 22, 2021 by Manveer Sahota and Ron Guerrero in
Apache Hadoop was created more than 15 years ago as an open source, distributed storage and compute platform designed for large data sets...

Apache Spark and Hadoop: Working Together

January 21, 2014 by Ion Stoica in
We are often asked how does Apache Spark fits in the Hadoop ecosystem , and how one can run Spark in a existing...

Hortonworks: A shared vision for Apache Spark on Hadoop

October 31, 2014 by John Kreisa in
This post is guest authored by our friends at Hortonworks announcing a broader partnership with Databricks around Apache Spark. At Hortonworks we are...
See all Partners posts