When we announced that the original team behind Apache Spark is starting a company around the project, we got a lot of excited questions. What areas will the company focus on, and what will it mean for the open source project? Today, in our first blog post at Databricks, we’re happy to share some of our goals, and say a little about what we’re doing next with Spark.
To start with, our mission at Databricks is simple: we want to build the very best computing platform for extracting value from data. Big data is a tremendous opportunity that is still largely untapped, and we’ve been working for the past six years to transform what can be done with it. Going forward, we are fully committed to building out the open source Apache Spark platform to achieve this goal.
How We Think about Big Data: Speed and Sophistication
In the past few years, open source technologies like Hadoop have made it dramatically easier to store large volumes of data. This capability is transforming a wide range of industries, from brick-and-mortar enterprises to the web. Over time though, simply collecting big data will not be enough to maintain a competitive edge. The question will be what can you do with this data.
We believe that two axes will determine how well an organization can draw value from data: speed and sophistication. By speed, we mean not only the speed at which we compute and return answers, but also the speed of development: how quickly can users take a new idea from the drawing board to a production application? By sophistication, we mean what type of analysis can be done. Today’s big data systems do not support the sophisticated analysis functions in tools like R and Matlab, limiting their scope. Enabling these types of analyses would greatly increase their value.
Through the Apache Spark project, we’ve been working to address both axes in a way that works seamlessly with the Hadoop stack. Released in 2010, Spark remains the only widely deployed engine for Hadoop to support in-memory computing and general execution graphs, as well as the easiest way to program applications on Hadoop data, with APIs in Scala, Java and Python. Released shortly after, Shark was the first system to speed up Hive by 100x, and is the only one of the new “SQL on Hadoop” engines to retain full Hive compatibility (by building directly on Hive) and to support in-memory computation. Looking forward, libraries like MLlib and GraphX are making it easy to call sophisticated machine learning and graph algorithms from Spark, while running them at memory speeds. These tools have already given numerous organizations the ability to do faster and richer data analysis, and we hope to bring them to hundreds more.
What We’re Working On
At Databricks, we’re committed to bringing Spark to an ever-wider set of users and greatly increasing its capabilities. Through both the recent Apache Spark 0.8 release and our ongoing work, we’ve been building out quite a few new features. Expect to see a focus in the following areas:
- Deployment: We want to make Spark effortless to deploy for any user, whether with or without an existing Hadoop cluster. Apache Spark 0.8 made significant strides in this respect with improved support for Mesos, EC2 and Hadoop YARN.
- High availability: One exciting feature that we’ve already merged into Apache Spark 0.8.1 is high availability for the master node. In general, due to the many users who are running Spark in availability-critical settings (e.g. streaming or user-facing applications), we want to make availability throughout the stack easier.
- New features: Besides these top-level goals, we have an exciting roadmap of features, such as Scala 2.10 support, new machine learning algorithms, graph computation, and updates to Spark Streaming, coming soon.
Most importantly, we believe that, despite the effort in the past few years, big data processing is still in its infancy, and there is tremendous room for tools that are faster, easier to use, and capable of richer computation. We hope you join us in defining the next generation of big data systems and unlocking the speed and sophistication that we believe is possible for big data.