2014 has been a year of tremendous growth for Apache Spark. It became the most active open source project in the Big Data ecosystem with over 400 contributors, and was adopted by many platform vendors – including all of the major Hadoop distributors. Through our ecosystem of products, partners, and training at Databricks, we also saw over 200 enterprises deploying Spark in production.
To help Spark achieve this growth, Databricks has worked broadly throughout the project to improve functionality and ease of use. Indeed, while the community has grown a lot, about 75% of the code added to Spark last year came from Databricks. In this post, we would like to highlight some of the additions we made to Spark in 2014, and provide a preview of our priorities in 2015.
In general, our approach to developing Spark is two-fold: improving usability and performance for the core engine, and expanding the functionality of libraries on top, such as streaming, SQL, and machine learning. Because all these libraries use the same core engine, they benefit from the same improvements in deployability, performance, etc.
Major Spark Additions in 2014
On the core engine, here are the major improvements we’ve made in 2014:
- Language support: A major requirement for many enterprises was to make Spark available in languages their users were most familiar with, such as Java and Scala. Databricks led the work to integrate Spark with Java 8, offering much simpler syntax to Java users, and led major additions to the Python API including performance improvements and the Python interfaces to MLlib, Spark Streaming and Spark SQL.
- Production management: We helped to add high-availability features to the Spark standalone master (allowing master recovery through ZooKeeper) and to Spark Streaming (allowing storing input reliably even from unreliable data sources to allow fault recovery later). We also worked with the community to make Spark scale dynamically on YARN, leading to better resource utilization, and to help integrate with Hadoop ecosystem features such as the Hadoop security model.
- Performance and stability: We rewrote Spark’s shuffle and network layers to provide significantly higher performance, and used this work to break the world record in sorting using Spark, beating the previous Hadoop-based record by 30x in per-node performance. More generally, we have worked broadly to make Spark operators run better on disk, allowing great performance at any scale from petabytes to megabytes.
On the libraries side, we’ve also had the fastest growth in Spark’s standard library to date. Databricks contributed the following:
- Spark SQL: We contributed a new module for structured data makes it much easier to use Spark with data sources like Apache Hive, Parquet and JSON, and provides fast SQL connectivity to BI tools like MicroStrategy, Qlik and Tableau. Through Spark SQL, both developers and analysts can now more easily leverage Spark clusters.
- Machine learning library: Databricks contributed multiple new algorithms and optimizations to MLlib, Spark’s machine learning library, speeding up some tasks by as much as 5x. We also contributed a statistics library as a new, high-level pipeline API to make it easier to write complete machine learning applications.
- Graph processing: We worked with UC Berkeley to add GraphX as the standard graph analytics library in Spark, giving users access to a variety of graph processing algorithms.
- API stability in Spark 1.0: On a more technical but very important level, we worked with the community to define API stability guarantees for Spark 1.x, which means that application written against Spark today will continue running on future versions. This is a crucial feature for enterprises and developers as it allows application portability across vendors and into future versions of Spark.
Looking back, it’s a bit hard to imagine that a year ago, Spark didn’t have built-in BI connectivity, rich monitoring, or about half of the higher-level libraries it contains today. Nonetheless, this is the rate at which fast-growing projects move. We’re thrilled to continue working with the community to bring even more great features to Spark.
Even though 2014 has been a great year for Spark, we know that we are only at the beginning of enterprise use of both Spark and big data in general. At Databricks, we’re focused on a handful of major initiatives for Spark in 2015:
- Empowering large scale data science. In 2015, Spark will expand its focus on data scientists by providing higher level, powerful API’s for statistical and analytical processing. The SparkR project, which allows use of Spark from R, is quickly coming of age, and work to merge SparkR into Spark is already under way. We’re also introducing a data frame library for use across all of Spark’s language API’s (Java, Scala, Python, and R) and a machine learning pipeline API in MLlib designed to inter-operate with data frames. The data frame library makes working with datasets, small or large, approachable for a wide range of users.
- Rich data source integration. The data management ecosystem is home to a variety of data sources and sinks. Our work on a pluggable data source API for Spark SQL will connect Spark to many traditional enterprise data sources in addition to the new wave of big data / NoSQL storage systems. Work is already under way to connect to JDBC, HBase, and DBF files. To showcase data sources and other Spark integrations from the broader community, we have also recently founded Spark packages, a community index to track third-party libraries available for Spark. Spark packages has over 30 libraries today; we expect it to grow substantially in 2015.
- Simplifying deployment with Databricks Cloud. Our main goal at Databricks remains to make big data simple. This extends beyond designing concise, elegant API’s available in Spark to providing a hassle-free runtime environment for our users. With Databricks Cloud, we make it easy for users to get started with Spark and big data within minutes, bypassing the many months of setup traditionally needed for a big data project.
Of course, you can also expect “more of the same”, and continued work on performance and capabilities through Spark. If you’d like to find out more about the latest Spark use cases and developments, sign up for Spark Summit East in New York City in March. The agenda for the conference was recently posted, and it’s going to be our best community conference yet, with high-quality talks from industries including healthcare, finance and transportation.