At its heart, Spark Streaming is a scheduling framework, able to efficiently collect and deliver data to Spark for further processing. While the DStream abstraction provides high-level functions to process streams, several operations also grant us access to deeper levels of the API, where we can directly operate on RDDs, transform them to Datasets to make use of that abstraction or store the data for later processing. Between these API layers lie many hooks that we can manipulate to enrich our Spark Streaming jobs. In this presentation we will demonstrate how to tap into the Spark Streaming scheduler to run arbitrary data workloads, we will show practical uses of the forgotten ‘ConstantInputDStream’ and will explain how to combine Spark Streaming with probabilistic data structures to optimize the use of memory in order to improve the resource usage of long-running streaming jobs. Attendees of this session will come out with a richer toolbox of techniques to widen the use of Spark Streaming and improve the robustness of new or existing jobs.
Session hashtag: #EUstr2
Gerard Maas contributes to Lightbend’s Fast Data Platform as a SW Engineer, where he focuses on the integration of stream processing technologies. Previously, he has held leading roles at several startups and large enterprises, building data science governance, cloud-native IoT platforms and scalable APIs. He enjoys giving tech talks, contributing to small and large open source projects, tinkering with drones, and building personal IoT projects. He’s co-author of Learning Spark Streaming, from O’Reilly Media.