Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming.
In particular, I’m going to discuss the following.
• Different stateful operations in Structured Streaming
• How state data is stored in a distributed, fault-tolerant manner using State Stores
• How you can write custom State Stores for saving state to external storage systems.
Session hashtag: #DD1SAIS
Tathagata Das is an Apache Spark committer and a member of the PMC. He's the lead developer behind Spark Streaming and currently develops Structured Streaming. Previously, he was a grad student in the UC Berkeley at AMPLab, where he conducted research about data-center frameworks and networks with Scott Shenker and Ion Stoica.