Databricks
MENUMENU
  • Platform
        • Overview
        • Unified Analytics Platform
          • Collaboration Workspace
          • Delta
          • Runtime or ML
          • Cloud service
        • Genomics
        • Enterprise Security
        • Pricing
        • Why Databricks
        • BY CLOUD ENVIRONMENT
        • Azure
        • AWS
        • BY ROLE
        • Data Engineering
        • Data Science
        • BY TECHNOLOGY
        • Apache Spark
        • TensorFlow
        • MLflow
        • R
        • Snowflake
        • PyTorch
        • Scikit-learn
        • 2019 Registration Now Open: Spark + AI Summit in San Francisco

          REGISTER NOW >

  • Solutions
        • By Use Case
        • Cybersecurity Analytics
        • Deep Learning
        • GDPR
        • Internet of Things
        • Machine Learning & Graph Processing
        • By Industry
        • Advertising & Marketing Technology
        • Energy and Utilities
        • Enterprise Technology & Software
        • Federal Government
        • Financial Services
        • Gaming
        • By Industry Cont.
        • Healthcare & Life Sciences
        • Media & Entertainment
        • Retail & Consumer Packaged Goods
        • Accelerate Discovery with Unified Analytics for Genomics

          LEARN MORE >

  • Customers
  • Learn
        • Dive In
        • Resources
        • Documentation
        • FAQ
        • Forums
        • Certification & Training

        • Training
          • Private Corporate Training
          • Public Training
          • Self-Paced Training
        • Certification
        • Learn Apache Spark Programming, Machine Learning and Data Science, and more

          REGISTER NOW >

  • Partners
        • FEATURED PARTNERS
        • Microsoft
        • AWS
        • RStudio
        • Snowflake
        • Partners
        • Technology
        • Service
        • 2019 Registration Now Open: Spark + AI Summit in San Francisco

          REGISTER NOW >

  • Events
        • Spark + Ai Summit
        • North America 2019
        • Europe 2018
        • Video Archive
        • Other Events
        • Upcoming Local Events
        • Webinars
        • Meetups
        • Global Events
        • Tradeshows
        • WORKSHOPS
        • Data + ML
        • Spark Live
        • Spark + AI Summit Training
        • 2019 Registration Now Open: Spark + AI Summit in San Francisco

          REGISTER NOW >
  • Open Source
        • Apache Spark
        • What is Spark
        • Comparing Spark & Databricks
        • Use Cases
        • Resources
        • Meetups
        • Technical Blogs
        • Other Technologies
        • PyTorch
        • MLflow
        • TensorFlow
        • 2019 Registration Now Open: Spark + AI Summit in San Francisco

          REGISTER NOW >

  • Company
        • Databricks
        • Our Story
        • Careers
        • Leadership
        • Board of Directors
        • Newsroom
        • Blogs
        • All Posts
        • Company Posts
        • Engineering Posts
        • Contact
        • Contact Databricks
        • Join us to help data teams solve the world's toughest problems

          SEE JOBS >
  • Toggle Search
  • Support
  • Contact
  • Log In
  • Try Databricks
Databricks Logo
  • Free Trial
  • Support
  • Log In
  • MENUMENU
    • Platform
          • Overview
          • Unified Analytics Platform
            • Collaboration Workspace
            • Delta
            • Runtime or ML
            • Cloud service
          • Genomics
          • Enterprise Security
          • Pricing
          • Why Databricks
          • BY CLOUD ENVIRONMENT
          • Azure
          • AWS
          • BY ROLE
          • Data Engineering
          • Data Science
          • BY TECHNOLOGY
          • Apache Spark
          • TensorFlow
          • MLflow
          • R
          • Snowflake
          • PyTorch
          • Scikit-learn
          • 2019 Registration Now Open: Spark + AI Summit in San Francisco

            REGISTER NOW >

    • Solutions
          • By Use Case
          • Cybersecurity Analytics
          • Deep Learning
          • GDPR
          • Internet of Things
          • Machine Learning & Graph Processing
          • By Industry
          • Advertising & Marketing Technology
          • Energy and Utilities
          • Enterprise Technology & Software
          • Federal Government
          • Financial Services
          • Gaming
          • By Industry Cont.
          • Healthcare & Life Sciences
          • Media & Entertainment
          • Retail & Consumer Packaged Goods
          • Accelerate Discovery with Unified Analytics for Genomics

            LEARN MORE >

    • Customers
    • Learn
          • Dive In
          • Resources
          • Documentation
          • FAQ
          • Forums
          • Certification & Training

          • Training
            • Private Corporate Training
            • Public Training
            • Self-Paced Training
          • Certification
          • Learn Apache Spark Programming, Machine Learning and Data Science, and more

            REGISTER NOW >

    • Partners
          • FEATURED PARTNERS
          • Microsoft
          • AWS
          • RStudio
          • Snowflake
          • Partners
          • Technology
          • Service
          • 2019 Registration Now Open: Spark + AI Summit in San Francisco

            REGISTER NOW >

    • Events
          • Spark + Ai Summit
          • North America 2019
          • Europe 2018
          • Video Archive
          • Other Events
          • Upcoming Local Events
          • Webinars
          • Meetups
          • Global Events
          • Tradeshows
          • WORKSHOPS
          • Data + ML
          • Spark Live
          • Spark + AI Summit Training
          • 2019 Registration Now Open: Spark + AI Summit in San Francisco

            REGISTER NOW >
    • Open Source
          • Apache Spark
          • What is Spark
          • Comparing Spark & Databricks
          • Use Cases
          • Resources
          • Meetups
          • Technical Blogs
          • Other Technologies
          • PyTorch
          • MLflow
          • TensorFlow
          • 2019 Registration Now Open: Spark + AI Summit in San Francisco

            REGISTER NOW >

    • Company
          • Databricks
          • Our Story
          • Careers
          • Leadership
          • Board of Directors
          • Newsroom
          • Blogs
          • All Posts
          • Company Posts
          • Engineering Posts
          • Contact
          • Contact Databricks
          • Join us to help data teams solve the world's toughest problems

            SEE JOBS >
  • Company Blog
    • Announcements
    • Customers
    • Events
    • Partners
    • Product
    • Security
  • Engineering Blog
    • Apache Spark
    • Ecosystem
    • Machine Learning
    • Platform
    • Streaming
  • See All
Follow @databricks on Twitter
Collapse

Subscribe

  • Blog
  • Newsletter

Follow

  • Follow @databricks on Twitter
  • Follow Databricks on LinkedIn
  • Follow Databricks on Facebook

Faster Stateful Stream Processing in Apache Spark Streaming

February 1, 2016 Tathagata DasShixiong Zhu by Tathagata Das and Shixiong Zhu Posted in Engineering Blog February 1, 2016
  • Share article on Twitter
  • Share article on LinkedIn
  • Share article on Facebook

To learn the latest developments in Apache Spark, register today to join the Spark community at Spark Summit in New York City!


Many complex stream processing pipelines must maintain state across a period of time. For example, if you are interested in understanding user behavior on your website in real-time, you will have to maintain information about each “user session” on the website as a persistent state and continuously update this state based on the user’s actions. Such stateful streaming computations could be implemented in Spark Streaming using its updateStateByKey operation.

In Apache Spark 1.6, we have dramatically improved our support for stateful stream processing with a new API, mapWithState. The new API has built-in support for the common patterns that previously required hand-coding and optimization when using updateStateByKey (e.g. sessions timeouts). As a result, mapWithState can provide up to 10x higher performance when compared to updateStateByKey . In this blog post, we are going to explain mapWithState in more detail as well as give a sneak peek of what is coming in the next few releases.

Stateful Stream Processing with mapWithState

One of the most powerful features of Spark Streaming is the simple API for stateful stream processing and the associated native, fault-tolerant, state management. Developers only have to specify the structure of the state and the logic to update it, and Spark Streaming takes care of distributing the state in the cluster, managing it, transparently recovering from failures, and giving end-to-end fault-tolerance guarantees. While the existing DStream operation updateStateByKey allows users to perform such stateful computations, with the new mapWithState operation we have made it easier for users to express their logic and get up to 10x higher performance. Let’s illustrate these advantages using an example.

Let’s say we want to learn about user behavior on a website in real time by monitoring the history of their actions. For each user, we need to maintain a history of user actions. Furthermore, based on this history, we want to output the user’s behavior model to a downstream data store.

To build this application with Spark Streaming, we have to get a stream of user actions as input (say, from Kafka or Kinesis),  transform it using mapWithState to generate the stream of user models, and then push them to the data store.

blog faster stateful streaming figure 1

Maintain user sessions with stateful stream processing in Spark Streaming

The mapWithState operation has the following abstraction. Imagine it to be an operator that takes a user action and the current user session as the input. Based on an input action, the operator can choose to update the user session and then output the updated user model for downstream operations. The developer specifies this updating function when defining the mapWithState operation.

Turning this into code, we start by defining the state data structure and the function to update the state.

def stateUpdateFunction(
userId: UserId,
newData: UserAction,
stateData: State[UserSession]): UserModel = {

val currentSession = stateData.get() // Get current session data
val updatedSession = … // Compute updated session using newData
stateData.update(updatedSession) // Update session data

val userModel = … // Compute model using updatedSession
return userModel // Send model downstream
}

Then, we define the mapWithState operation on a DStream of user actions. This is done by creating a StateSpec object which contains all the specification of the operation.

// Stream of user actions, keyed by the user ID
val userActions = … // stream of key-value tuples of (UserId, UserAction)

// Stream of data to commit
val userModels = userActions.mapWithState(StateSpec.function(stateUpdateFunction))

New Features and Performance Improvements with mapWithState

Now that we have seen at an example of its use, let’s dive into the specific advantages of using this new API.

Native support for session timeouts

Many session-based applications require timeouts, where a session should be closed if it has not received new data for a while (e.g., the user left the session without explicitly logging out). Instead of hand-coding it in updateStateByKey, developers can directly set timeouts in mapWithState.

userActions.mapWithState(StateSpec.function(stateUpdateFunction).timeout(Minutes(10)))

Besides timeouts, developers can also set partitioning schemes and initial state information for bootstrapping.

Arbitrary data can be sent downstream

Unlike updateStateByKey, arbitrary data can be sent downstream from the state update function, as already illustrated in the example above (i.e. the user model returned from the user session state). Furthermore, snapshots of the up-to-date state (i.e. user sessions) can also be accessed.

val userSessionSnapshots = userActions.mapWithState(statSpec).snapshotStream()

The userSessionSnapshots is a DStream where each RDD is a snapshot of updated sessions after each batch of data is processed. This DStream is equivalent to the DStream returned by updateStateByKey.

Higher performance

Finally, mapWithState can provide 6X lower latency and maintain state for 10X more keys than when using updateStateByKey. This increase in performance and scalability is demonstrated by  the following benchmark results. All these results were generated with 1-second batches and the same cluster size. The following graph compares the average time taken to process each 1-second batch when using mapWithState and updateStateByKey. In each case, we maintained the state for the same number of keys (from 0.25 to 1 million keys), and updated them at the same rate (30k updates / sec). As shown below, mapWithState can provide up to 8X lower processing times than updateStateByKey, therefore allowing lower end-to-end latencies.

blog faster stateful streaming figure 2

Up to 8X lower batch processing times (i.e.latency) with mapWithState than updateStateByKey

Furthermore, faster processing allows mapWithState to manage 10X more keys compared with updateStateByKey (with the same batch interval, cluster size, update rate in both cases).

blog faster stateful streaming figure 3

Up to 10X more keys in state with mapWithState than updateStateByKey

This dramatic improvement is achieved by avoiding unnecessary processing for keys where no new data has arrived. Limiting the computation to keys with new data reduces the amount of processing time for each batch, allowing lower latencies and higher number of keys to be maintained.

Other Improvements in Spark Streaming

Besides mapWithState, there are a number of additional improvements in Spark Streaming in Spark 1.6. Some of them are as follows:

  • Streaming UI improvements [SPARK-10885, SPARK-11742]: Job failures and other details have been exposed in the streaming UI for easier debugging.
  • API improvements in Kinesis integration [SPARK-11198, SPARK-10891]: Kinesis streams have been upgraded to use KCL 1.4.0 and support transparent de-aggregation of KPL-aggregated records. In addition, arbitrary function can now be applied to a Kinesis record in the Kinesis receiver before to customize what data is to be stored in memory
  • Python Streaming Listener API [SPARK-6328] – Get streaming statistics (scheduling delays, batch processing times, etc.) in streaming.
  • Support for S3 for writing Write Ahead Logs (WALs) [SPARK-11324, SPARK-11141]: Write Ahead Logs are used by Spark Streaming to ensure fault-tolerance of received data. Spark 1.6 allows WALs to be used on S3 and other file systems that do not support file flushes. See the programming guide for more details.

If you want to try out these new features, you can already use Spark 1.6 in Databricks, alongside older versions of Spark.  Sign up for a free trial account here.

Databricks Blog

Try Databricks for free. Get started today
Related Terms:
  • Term: Unified Analytics
  • Term: Spark Streaming
  • Term: Resilient Distributed Dataset (RDD)
See all Engineering Blog posts

Share Post

  • Share article on Twitter
  • Share article on LinkedIn
  • Share article on Facebook
  • Product
    • Databricks
    • Feature Comparison
    • Pricing
    • Security
    • Documentation
    • FAQ
    • Forums
  • Apache Spark
    • About Apache Spark
    • SparkHub (Community)
    • Developer Resources
    • Certification
    • Instructor-Led Apache Spark Training
  • Solutions
    • Industries
    • Data Science Teams
    • Data Engineering Teams
    • Use Cases
  • Customers
  • Company
    • About Us
    • Leadership
    • Board of Directors
    • Partners
    • Newsroom
    • Careers
    • Contact Us
  • Blog
    • See All
    • Company Blog
    • Engineering Blog
  • Resources

Databricks Inc.
160 Spear Street, 13th Floor
San Francisco, CA 94105
1-866-330-0121

Contact Us

  • Follow @databricks on Twitter
  • Follow Databricks on LinkedIn
  • Follow Databricks on Facebook
  • Databricks Blog RSS feed
  • Follow Databricks on Youtube

© Databricks . All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.
Privacy Policy | Terms of Use