Spark Streaming lets users develop and continuously deliver fresh analytical answers. And it does that with the least amount of overhead when compared to a batch job. But one hard part of Streaming with Spark is in tuning a cluster, especially in high-throughput situations. This talk will draw on the experience of deploying clusters dealing with millions of updates per second to show how to do it better. After understanding the internals of Spark Streaming, we will explain how to scale ingestion, parallelism, data locality, caching and logging. But will every step of this fine-tuning remain necessary forever? As we dive in recent work on Spark Streaming, we will show how clusters can self adapt to high-throughput situations. The audience will take away a better grasp of Streaming internals, and know how to set their cluster for long running jobs. After a quick introduction to Reactive Streams, they will also get how asynchronous back pressure helps make Streaming more resilient.
Gerard is the lead of the Data Processing Team at Virdata.com where he and his team work on building and extending the data processing pipeline for Virdata's IoT cloud platform. He has a background in Computer Science and is a former Java geek now converted to Scala. Through his career in technology companies like Alcatel-Lucent, Bell Labs and Sony he has been mostly involved in the interaction of back-end services and devices, which has now converged in his IoT focused work at Virdata.
François Garillot joined Swisscom in 2015, and has worked since on curating and understanding telecommunications data through big data tools. Previously, he has been working on Apache Spark Streaming's reliability at Lightbend (formerly Typesafe). A select few of interests span machine learning - especially online models, approximation & hashing techniques, control theory, and unsupervised time series analysis. But he also enjoys skiing, sailing and hunting for good cheese in his free time.