Making Structured Streaming Ready for Production – Updates and Future Directions

Download Slides

In mid-2016, we introduced Structured Steaming, a new stream processing engine built on Spark SQL that revolutionized how developers can write stream processing application without having to reason about having to reason about streaming. It allows the user to express their streaming computations the same way you would express a batch computation on static data. The Spark SQL engine takes care of running it incrementally and continuously updating the final result as streaming data continues to arrive. It truly unifies batch, streaming and interactive processing in the same Datasets/DataFrames API and the same optimized Spark SQL processing engine.
The initial alpha release of Structured Streaming in Apache Spark 2.0 introduced the basic aggregation APIs and files as streaming source and sink. Since we have put in a lot of work to make it ready for production use. In this talk, I am going to talk in more detail about the major features we have added, the recipes for using them in production, and the exciting new features we have plans for in future releases. Some of these features are as follows.

– Design and use of the Kafka Source
– Support for watermarks and event-time processing
– Support for more operations and output modes

About Tathagata Das

Tathagata Das is an Apache Spark committer and a member of the PMC. He's the lead developer behind Spark Streaming, which he started while a PhD student in the UC Berkeley AMPLab, and is currently employed at Databricks. Prior to Databricks, Tathagata worked at the AMPLab, conducting research about data-center frameworks and networks with Scott Shenker and Ion Stoica.