Apache Spark is a powerful open source processing engine for Hadoop data built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010.
In subsequent years it has seen rapid adoption, used by enterprises small and large across a wide range of industries. It has quickly become one of the largest open source communities in big data, with over 200 contributors from 50+ organizations.
Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk.
Ease of Use
Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level operators. And you can use it interactively to query data within the shell.
In addition to simple “map” and “reduce” operations, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out-of-the-box. Better yet, users can combine all these capabilities seamlessly in a single workflow.
Spark + Hadoop
What is Hadoop?
Hadoop is the de-facto standard for large scale data processing across nearly every industry and enterprise, with numerous vendors providing Hadoop “distributions” coupled with enterprise-grade support services.
In short, Hadoop scales out computation and storage across cheap commodity servers and allows other applications to run on top of both of these — Spark is one of these applications.
Unlocking your Hadoop data with Spark
Spark runs on top of existing Hadoop clusters to provide enhanced and additional functionality.
Although Hadoop is effective for storing vast amounts of data cheaply, the computations it enables with MapReduce are highly limited. MapReduce is only able to execute simple computations and uses a high-latency batch model. Spark provides a more general and powerful alternative to Hadoop’s MapReduce, offering rich functionality such as stream processing, machine learning, and graph computations.
Built on Hadoop Storage: Spark is 100% compatible with Hadoop’s Distributed File System (HDFS), HBase, and any Hadoop storage system, so your existing data is immediately usable in Spark.
Ease of Deployment: Spark provides out of the box support for deploying within an existing Hadoop v1 cluster (with SIMR – Spark-Inside-MapReduce) or a Hadoop v2 YARN cluster. Additionally, Spark has built-in scripts for launching on Amazon EC2.
General execution: Spark Core
Spark Core is the underlying general execution engine for the Spark platform that all other functionality is built on top of. It provides in-memory computing capabilities to deliver speed, a generalized execution model to support a wide variety of applications, and Java, Scala, and Python APIs for ease of development.
Structured Data: Spark SQL
Many data scientists, analysts, and general business intelligence users rely on interactive SQL queries for exploring data. Spark SQL is an engine for Hive data that enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data. It also provides powerful integration with the rest of the Spark ecosystem (e.g., integrating SQL query processing with machine learning).
Streaming analytics: Spark Streaming
Many applications need the ability to process and analyze not only batch data, but also streams of new data in real-time. Running on top of Spark, Spark Streaming enables powerful interactive and analytical applications across both streaming and historical data, while inheriting Spark’s ease of use and fault tolerance characteristics. It readily integrates with a wide variety of popular data sources, including HDFS, Flume, Kafka, and Twitter.
Machine Learning: MLLib
Machine learning has quickly emerged as a critical piece in mining Big Data for actionable insights. Built on top of Spark, MLLib is a scalable machine learning library that delivers both high-quality algorithms (e.g., multiple iterations to increase accuracy) and blazing speed (up to 100x faster than MapReduce). The library is usable in Java, Scala, and Python as part of Spark applications, so that you can include it in complete workflows.
One of the benefits of Spark’s vibrant open-source community is continued innovation that helps extend Spark’s capabilities, many of which originated in UC Berkeley’s AMPLab. Here is a sampling of some on-going projects in the community (that are still in alpha):
BlinkDB: An approximate query engine for interactive SQL queries in Shark that allows users to trade-off query accuracy for response time. This enables interactive queries over massive data by using data samples and presenting results annotated with meaningful error bars.
GraphX: A graph computation engine built on top of Spark that enables users to interactively build, transform and reason about graph structured data at scale.
SparkR: A package for the R statistical language that enables R-users to leverage Spark functionality interactively from within the R shell.
The 100% open source Apache Spark project can be downloaded from Apache. The site also contains installation instructions, video tutorials, and documentation to get you started.