Today, Cloudera announced that it will distribute and support Apache Spark. We are very excited about this announcement, and what it brings to the Spark platform and the open source community. So what does this announcement mean for Spark?
First, it validates the maturity of the Spark platform. Started as a research project at UC Berkeley in 2009, Spark is the first general purpose cluster computing engine that can run sophisticated computations at memory speeds on Hadoop clusters. Spark started with the goal of providing efficient support for iterative algorithms (such as machine learning) and interactive queries, workloads not well supported by MapReduce. Since then, Spark has grown to support other applications such as streaming, and has gained rapid industry adoption. Today, Spark is used in production by numerous companies, and it counts on an ever growing open source community with over 90 contributors from 25 companies.
Second, it will make the Spark platform available to a wide range of enterprise customers both in US and internationally. By being distributed in conjunction with Cloudera’s CDH, Spark will enjoy the same enterprise-grade support as the other components in Cloudera’s stack. Databricks is fully committed to working with Cloudera to guarantee that its customers will have the best possible support. Furthermore, we are looking forward to this partnership to enable new categories of exciting applications, and address unique usage scenarios.
Third, this partnership underlines and strengthens the integration of Spark into the Hadoop ecosystem. Spark has the ability to read Hadoop files, share data with other Hadoop frameworks, and support existing Hadoop workloads, including Hive queries. This integration is beneficial not only for Spark, but for the Hadoop ecosystem as a whole, as Spark brings new capabilities to the Hadoop ecosystem through its ability to run on top of Hadoop YARN. This could give Hadoop users the opportunity to run jobs up to 100x faster than MapReduce, while writing 2-5x less code. For example, a data scientist could leverage Spark’s simple yet powerful API to rapidly develop machine learning algorithms, and then run them at memory speeds on her Hadoop data.
Finally, we want to reiterate our full commitment to open source. The success Spark has enjoyed thus far has only been possible because of a vibrant open source community, who has contributed a continuous stream of new functionality and bug fixes. We believe this partnership will ignite a new wave of growth of our community and accelerate the development of the Apache Spark platform to support an ever growing number of customers.
We have no doubt that we are just at the beginning of a journey to give users the tools to solve tomorrow’s big data challenges. The next stop in this journey is the Spark Summit. Sponsored by leading big data companies and Spark users, including Databricks and Cloudera, this is the first conference that will bring together the Spark community. Come and join us on this journey!