With Apache Spark 1.1 recently released, we’d like to take this occasion to feature one of the most popular Spark components - Spark Streaming - and highlight who is using Spark Streaming and why.
Apache Spark 1.1. adds several new features to Spark Streaming. In particular, Spark Streaming extends its library of ingestion sources to include Amazon Kinesis, a hosted stream processing engine, as well as to provide high availability for Apache Flume sources. Moreover, Apache Spark 1.1 adds the first of a set of online machine learning algorithms with the introduction of a streaming linear regression.
Many organizations have evolved from exploratory, discovery use cases of big data to use cases that require reasoning on data as it arrives in order to make decisions in real time. Spark Streaming enables this category of high-value use cases, providing a system for processing fast and large streams of data in real time.
What is it?
Spark Streaming is an extension of the core Spark API that enables high-throughput, reliable processing of live data streams. Spark Streaming ingests data from any source including Amazon Kinesis, Kafka, Flume, Twitter and file systems such as S3 and HDFS. Users can express sophisticated algorithms easily using high-level functions to process the data streams. The core innovation behind Spark Streaming is to treat streaming computations as a series of deterministic micro-batch computations on small time intervals, executed using Spark's distributed data processing framework. Micro-batching unifies the programming model of streaming with that of batch use cases and enables strong fault recovery guarantees while retaining high performance. The processed data can then be stored in any file system (including HDFS), database (including Hbase), or live dashboards.
Where is it being used?
Spark Streaming has seen a significant uptake in adoption in the past year as enterprises increasingly use it as part of Spark deployments. The Databricks team is aware of more than 40 organizations that have deployed Spark Streaming in production.
Just as impressive is the breadth of industries across which Spark Streaming is being used. For instance:
We’ve seen Spark Streaming benefit many parts of an organization, as the following examples illustrate:
Why is it being used?
The reasons enterprises give for adopting (and in many cases transitioning to) Spark Streaming often start with the advantages that Spark itself brings. All the strengths of Spark’s unified programming model apply to Spark Streaming, which is particularly relevant for real-time analytics that combine historical data with fresh data:
We have also learned from the community that the high throughput that Spark Streaming provides is just as important as latency. In fact, latency of a few hundred milliseconds is sufficient for the vast majority of streaming use cases. Rare exceptions include algorithmic trading.
One capability that allows Spark Streaming to be deployed in such a wide variety of situations is that users have a choice of three resource managers: Full integration with YARN and Mesos as well as the ability to rely on Spark’s easy-to-use stand-alone resource manager. Moreover, Spark and Spark Streaming are supported already by leading vendors such as Cloudera, MapR and Datastax. We expect other vendors will include and support Spark in their Hadoop distributions in the near future.
Please stay tuned for future posts on Spark Streaming technical design patterns and practical use cases.