In Spark 2.0, we have extended DataFrames and Datasets in Spark to handle streaming data. Streaming Datasets not only provides a single programming abstraction for batch and streaming data, it brings support for event-time based processing, out-or-order/delayed data, sessionization and tight integration with non-streaming data sources and sinks. In this talk, I will take a deep dive into the concepts and the API and show how this simplifies building complex “continuous applications”.
Tathagata Das is an Apache Spark committer and a member of the PMC. He's the lead developer behind Spark Streaming, which he started while a PhD student in the UC Berkeley AMPLab, and is currently employed at Databricks. Prior to Databricks, Tathagata worked at the AMPLab, conducting research about data-center frameworks and networks with Scott Shenker and Ion Stoica.