Gerard is the lead of the Data Processing Team at Virdata.com where he and his team work on building and extending the data processing pipeline for Virdata’s IoT cloud platform. He has a background in Computer Science and is a former Java geek now converted to Scala. Through his career in technology companies like Alcatel-Lucent, Bell Labs and Sony he has been mostly involved in the interaction of back-end services and devices, which has now converged in his IoT focused work at Virdata.
Spark Streaming lets users develop and continuously deliver fresh analytical answers. And it does that with the least amount of overhead when compared to a batch job. But one hard part of Streaming with Spark is in tuning a cluster, especially in high-throughput situations. This talk will draw on the experience of deploying clusters dealing with millions of updates per second to show how to do it better. After understanding the internals of Spark Streaming, we will explain how to scale ingestion, parallelism, data locality, caching and logging. But will every step of this fine-tuning remain necessary forever? As we dive in recent work on Spark Streaming, we will show how clusters can self adapt to high-throughput situations. The audience will take away a better grasp of Streaming internals, and know how to set their cluster for long running jobs. After a quick introduction to Reactive Streams, they will also get how asynchronous back pressure helps make Streaming more resilient.
At its heart, Spark Streaming is a scheduling framework, able to efficiently collect and deliver data to Spark for further processing. While the DStream abstraction provides high-level functions to process streams, several operations also grant us access to deeper levels of the API, where we can directly operate on RDDs, transform them to Datasets to make use of that abstraction or store the data for later processing. Between these API layers lie many hooks that we can manipulate to enrich our Spark Streaming jobs. In this presentation we will demonstrate how to tap into the Spark Streaming scheduler to run arbitrary data workloads, we will show practical uses of the forgotten 'ConstantInputDStream' and will explain how to combine Spark Streaming with probabilistic data structures to optimize the use of memory in order to improve the resource usage of long-running streaming jobs. Attendees of this session will come out with a richer toolbox of techniques to widen the use of Spark Streaming and improve the robustness of new or existing jobs. Session hashtag: #EUstr2