Skip to main content

Earlier this month we held the first Spark Summit, a conference to bring the Apache Spark community together. We are excited to share some statistics and highlights from the event.

  • 450 participants from over 180 companies attended
  • Participants came from 13 countries
  • Spark training was sold out at 200 participants from 80 companies
  • 20 organizations sponsored the event, including all major Hadoop platform vendors
  • 20 different organizations gave talks

The Summit included Keynotes from Databricks, the UC Berkeley AMPLab, and Yahoo, as well as presentations from 18 other companies including Amazon, Red Hat, and Adobe. Talk topics covered a wide range including specialized applications such as mapping and manipulating the brain, product launches, and research projects about future directions for the project. We are very excited to see Spark and related projects come such a long way from research prototypes originally developed in AMPLab at Berkeley to being used in production by startups and large companies alike.

The State of Spark, and Where We’re Going Next

In the first keynote of the day, Matei Zaharia, who started the Spark project and is now CTO at Databricks, gave a summary of recent growth in the project1, highlighting key contributions from across the community. In particular, Spark recently reached 100 contributors, making it the second-largest open source development community in the Big Data space after Hadoop, and it’s also overtaken Hadoop in the past 6 months. Matei also previewed what is coming up in the Spark development roadmap, including features under development for the upcoming 0.8.1 and 0.9 releases including high availability for the Spark Master in standalone mode, external hashing and sorting, support for Scala 2.10, and more. Finally, Matei discussed features of Spark that differentiate it from other projects, such as the focus on unification of diverse programming models.

Spark and Hadoop

It was exciting to hear from Eric Baldeschwieler, whose background includes leading the Yahoo team that took Hadoop from being a prototype project to what it is today, as well as serving as both CEO and CTO of Hortonworks. In his keynote, Eric presented his view of how Spark is on track to become the “lingua franca” for Big Data. He also talked about how Spark updates Hadoop with important features such as effective utilization of RAM, low latency queries, and streaming ingest. Finally, he discussed how the Spark and Hadoop projects relate to each other now and going forward.

Spark Training Day

On the sold-out second day, 200 attendees heard 4 talks on using, deploying, and administering Spark, and also participated in hands-on training led by the team that started the Spark research project at UC Berkeley that later became Apache Spark. Amazon, a sponsor of the Summit, donated EC2 resources so participants could each have their own 6 node Spark cluster to practice using Spark, Spark Streaming and Shark on real Wikipedia and Twitter data.

The final talk of the training day was given by core Spark developer and Databricks cofounder Patrick Wendell. Patrick’s talk was about Administering Spark and was prepared in response to requests for more advanced technical material as part of the training. In it he dove into the core software components of the project, the different resource managers that Spark runs on, what type of hardware to use when running Spark, how to link against Spark when writing applications, monitoring Spark clusters running in production, and more.

Footnotes

  1. For more on this, check out our recent blog post about the growth of the Spark community.
Try Databricks for free

Related posts

When Stratio Met Apache Spark: A True Love Story

August 8, 2014 by Oscar Mendez in
This is a guest post from our friends at Stratio announcing that their platform is now a "Certified Apache Spark Distribution". Certified distribution Stratio is delighted to announce that it is officially a Certified Apache Spark Distribution. The certification is very important for us because we deeply believe that the certification program provides many benefits to the Spark community: It facilitates collaboration and integration, offers broad evolution an

Accelerating developers by ditching the data center

June 10, 2020 by R Tyler Croy in
Guest blog by R Tyler Croy, Director of Platform Engineering at Scribd People don’t tend to get excited about the data platform. It...

The Architecture of the Next CERN Accelerator Logging Service

December 14, 2017 by Jakub Wozniak in
This is a community guest blog from Jakub Wozniak , a software engineer and project technical lead at CERN physics laboratory, further expounding...
See all Company Blog posts