Spark Streaming

Apache Spark Streaming is the previous generation of Apache Spark’s streaming engine. There are no longer updates to Spark Streaming and it’s a legacy project. There is a newer and easier to use streaming engine in Apache Spark called Structured Streaming. You should use Spark Structured Streaming for your streaming applications and pipelines. See Structured Streaming.

What is Spark Streaming?

Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively supports both batch and streaming workloads. Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time data from various sources including (but not limited to) Kafka, Flume, and Amazon Kinesis. This processed data can be pushed out to file systems, databases, and live dashboards. Its key abstraction is a Discretized Stream or, in short, a DStream, which represents a stream of data divided into small batches. DStreams are built on RDDs, Spark’s core data abstraction. This allows Spark Streaming to seamlessly integrate with any other Spark components like MLlib and Spark SQL. Spark Streaming is different from other systems that either have a processing engine designed only for streaming, or have similar batch and streaming APIs but compile internally to different engines. Spark’s single execution engine and unified programming model for batch and streaming lead to some unique benefits over other traditional streaming systems.

Here’s more to explore

Big Book of Data Engineering

Fast-track your expertise with this essential guide for the AI era.

Read now

O’Reilly technical guide about ETL pipelines

Get started with ETL

Learn about ETL pipelines with this O’Reilly technical guide

Download now

Delta Lake: The Definitive Guide by O’Reilly

Get O’Reilly’s new eBook to understand the key data reliability challenges and how to tackle them.

Read now

Four Major Aspects of Spark Streaming

Fast recovery from failures and stragglers
Better load balancing and resource usage
Combining of streaming data with static datasets and interactive queries
Native integration with advanced processing libraries (SQL, machine learning, graph processing)

apache spark

This unification of disparate data processing capabilities is the key reason behind Spark Streaming’s rapid adoption. It makes it very easy for developers to use a single framework to satisfy all their processing needs.

Additional Resources

Back to Glossary