At its heart, Spark Streaming is a scheduling framework, able to efficiently collect and deliver data to Spark for further processing. While the DStream abstraction provides high-level functions to process streams, several operations also grant us access to deeper levels of the API, where we can directly operate on RDDs, transform them to Datasets to make use of that abstraction or store the data for later processing. Between these API layers lie many hooks that we can manipulate to enrich our Spark Streaming jobs. In this presentation we will demonstrate how to tap into the Spark Streaming scheduler to run arbitrary data workloads, we will show practical uses of the forgotten ‘ConstantInputDStream’ and will explain how to combine Spark Streaming with probabilistic data structures to optimize the use of memory in order to improve the resource usage of long-running streaming jobs. Attendees of this session will come out with a richer toolbox of techniques to widen the use of Spark Streaming and improve the robustness of new or existing jobs.
Session hashtag: #EUstr2
Gerard is the lead of the Data Processing Team at Virdata.com where he and his team work on building and extending the data processing pipeline for Virdata's IoT cloud platform. He has a background in Computer Science and is a former Java geek now converted to Scala. Through his career in technology companies like Alcatel-Lucent, Bell Labs and Sony he has been mostly involved in the interaction of back-end services and devices, which has now converged in his IoT focused work at Virdata.