To learn more about Apache Spark, attend Spark Summit East in New York in Feb 2016.
2015 has been a phenomenal year of growth for both Databricks and the Apache Spark project. In June, we launched general availability (GA) of our cloud platform, the first end-to-end enterprise data platform based on Spark. At the same time, we have continued our efforts in training Spark developers and of course in developing Spark itself. In this post, we wanted to share some updates about each of these efforts, and let you know what we’ve been up to in 2015:
- In the time since GA, Databricks has been adopted by over 200 paying customers, making it the Spark platform with the largest number of customers among any enterprise vendor.
- Databricks has trained over 20,000 Spark developers in 2015, again more than any other company.
- In 2015, we continued to be the largest contributor to the Apache Spark project, with 10x more code contributions than any other company.
Simplifying Enterprise Data Applications with Databricks
When our team first designed Spark at UC Berkeley, we wanted to make writing big data applications easier. However, we realized that much more was needed to make big data simple for an organization: big data projects spend most of their effort managing infrastructure, loading data, and keeping production jobs running. This is why we developed Databricks, an end-to-end, managed platform based on Spark. With Databricks, organizations can immediately start working on their data problems, in an environment accessible to data scientists, engineers, and business users alike.
Since the GA of this platform in June, Databricks has been adopted by over 200 paying customers, with applications ranging from data warehousing and reporting to real-time machine learning. We have also learned a lot from our customers’ use of the platform. In particular, we saw three interesting trends:
- Ease of adoption. Although big data has traditionally required specialized expertise, 40% of our customers indicate that they had never used Spark or Hadoop before deploying Databricks, and were still able to become proficient with the platform. Spark’s approachability combined with automatic operations made Databricks easy to adopt for users only familiar with “small data”.
- Democratization of data. While most deployments begin with a small data science team, many have quickly grown to 50+ users, because the platform makes data access easy for non-experts. With Databricks, data scientists can publish notebooks or dashboards that business users can use to explore the data curated by the team. This saves time for both parties and lets more users work with the company’s data.
- Enterprise security enhancements. We learned from our customers that with the ease of sharing and collaboration in notebooks, enterprise security concerns come to the forefront. With full support for role-based access control, auditing, and encryption on wire and disk, sharing data in an organization is secure and easy.
To give a sense of what organizations have been able to do with Databricks, some of our customer highlights in 2015 included:
|With Databricks, Elsevier Labs – the advanced R&D group within Elsevier, a global provider of scientific information – completed advanced analytics projects faster (weeks to days) and broadened access to data (15 people contributing instead of limiting to two or three specialists).|
|MyFitnessPal’s legacy data pipeline was slow, did not scale, and lacked flexibility. Databricks helped them solve all of these challenges with our automatically managed Spark clusters, interactive workspace, and a production job scheduler to easily transition from development to production.|
|Celtra expanded the number of people able to work with their data by a factor of four allowing them to increase the amount of ad-hoc analysis done six-fold. View the webinar How Celtra Optimizes its Advertising Platform with Databricks to see how users across the organization use this data.|
Training Data Scientists and Engineers
As developers at heart, a key part of our mission has also been to empower other professionals to tackle big data problems. We are happy to note that in 2015 we trained over 20,000 developers on Spark, more than any other company.
Spark education was top of mind for us in 2015 with the launch of several key programs:
- Databricks private training program providing on-site instructor-led training. Our trainers are certified by our core Spark development team, which also contributes directly to our courseware.
- We partnered with UC Berkeley and UCLA to launch two massive online open courses (MOOCs) on Spark. The first course, Introduction to Big Data with Apache Spark, teaches students about Spark and data analysis. The second course, Scalable Machine Learning, introduces students to machine learning with Spark. Both courses are freely available on the edX platform. Over 125,000 students registered for the first delivery of the two classes with 24% engaged and 12% passing Introduction to Big Data with Apache Spark (this is 2.5 times greater than the average MOOC completion rate).
- The academic partners program gives educators free access to Databricks for research and classroom use. We worked with universities across the world including Stanford, UC Berkeley, and UIUC, supporting Spark education for over 1200 students. In addition to supporting academic research, we have been publishing our own work at top research venues such as SIGMOD and VLDB.
Spark Community Leadership
Our 2015 Spark Survey results validate that our work to make Spark data processing easy and accessible is resonating with Spark users across many industries. Key findings from the survey included:
- Spark is growing beyond Hadoop: Only 40% of Spark users deploy it inside Hadoop while 48% deploy it standalone and 11% on Apache Mesos. Whereas most Spark deployments had traditionally been in Hadoop, we now see cloud deployments (51%) and data sources other than Hadoop (e.g. Cassandra) become increasingly popular.
- Streaming and advanced analytics uses rising: Spark is being used for an increasingly diverse set of applications, particularly machine learning, streaming, and graph analytics.
- Increasing access to big data: Spark is breaking down technology barriers between data scientists, analysts, and engineers, who are working collaboratively to solve problems. In particular, we see the rapid growth of Spark use in languages like SQL and Python and through BI tools.
- Spark’s most popular use cases came to light: the most common use cases were business intelligence (68%), data warehousing (52% of organizations), recommendation engines (48%), log processing (40%), and fraud detection and security (29%).
As the Spark community expands at an amazing pace (with 650 contributors in 2015 alone), Databricks has continued to be the largest contributor to the Apache Spark project, providing 10x more code than any other company. We consider the success of Spark one of our key missions, and to this end we have contributed to all areas of Spark in 2015. Some of our major contributions this year were:
- DataFrames, an easy-to-use and efficient API for working with structured data similar to “small data” tools like R.
- Project Tungsten, the largest update to date of Spark’s internals to provide more efficient execution on modern hardware.
- Machine learning pipelines, an easy-to-use API for complete machine learning workflows.
- Multiple features in SparkR, the new R language interface to Spark.
- New advanced analytics algorithms, data sources, and monitoring tools.
For a deep dive on the major additions in 2015, please read Reynold Xin’s blog post here: Spark 2015 Year In Review.
But it’s not all about the code: nurturing the Spark community is also about bringing together users. To this end, we have brought together 4000 attendees through three Spark Summits, bringing the conference to Europe and New York for the first time. We have also contributed to dozens of local meetup groups with our Meetup-in-the-box initiative. We plan to expand both these initiatives in 2016.
Finally, 2015 was also a significant year for our partners. We are happy to see IBM, Hortonworks, Intel, Cloudera, and MapR, just to name a few, investing significantly in Spark. We look forward to continuing our collaboration with them in 2016, to build a stronger Spark community.
While 2015 was exciting, we believe it is still only the beginning for both Databricks and Apache Spark. Our overall mission is to make big data simple, allowing every enterprise to gain value from its data. Our experience with Databricks customers so far shows that this is indeed possible: with a fully managed end-to-end platform, customers are completing projects in a fraction of the time it had taken with previous tools, and simultaneously making their data accessible to more users in their organization well beyond the “big data experts”. In 2016, we will continue to work with our customers, partners, and the Spark community to make extracting value from data even easier.