Skip to main content

Apache Spark™

Apache Spark is a lightning-fast unified analytics engine for big data and machine learning. It was originally developed at UC Berkeley in 2009.

largest open source apache spark image

The largest open source project in data processing.

Since its release, Apache Spark, the unified analytics engine, has seen rapid adoption by enterprises across a wide range of industries. Internet powerhouses such as Netflix, Yahoo, and eBay have deployed Spark at massive scale, collectively processing multiple petabytes of data on clusters of over 8,000 nodes. It has quickly become the largest open source community in big data, with over 1000 contributors from 250+ organizations.

The team that started the Spark research project at UC Berkeley founded Databricks in 2013.

Apache Spark is 100% open source, hosted at the vendor-independent Apache Software Foundation. At Databricks, we are fully committed to maintaining this open development model. Together with the Spark community, Databricks continues to contribute heavily to the Apache Spark project, through both development and community evangelism.

What is Apache Spark - Benefits of Apache Spark

Speed

Engineered from the bottom-up for performance, Spark can be 100x faster than Hadoop for large scale data processing by exploiting in memory computing and other optimizations. Spark is also fast when data is stored on disk, and currently holds the world record for large-scale on-disk sorting.

Ease of Use

Spark has easy-to-use APIs for operating on large datasets. This includes a collection of over 100 operators for transforming data and familiar data frame APIs for manipulating semi-structured data.

A Unified Engine

Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning and graph processing. These standard libraries increase developer productivity and can be seamlessly combined to create complex workflows.

Try Apache Spark on the Databricks cloud for free

The Databricks Unified Analytics Platform offers 5x performance over open source Spark, collaborative notebooks, integrated workflows, and enterprise security — all in a fully managed cloud platform.

The open source Apache Spark project can be downloaded here