Stefan is a performance and scalability subject matter expert at Databricks. He has a background in parallel distributed systems and has years of experience in the Big Data Analytics field. More recently, he is focusing on deploying Structured Streaming applications at scale, advising clients on how they can build out their pipelines from proof of concepts to production grade systems.
Running a stream in a development environment is relatively easy. However, some topics can cause serious issues in production when they are not addressed properly. In this presentation we want to cover 4 topics that, when not addressed, can lead to serious issues for streams in production. The first topic considers what happens if input parameters of your stream are not properly configured. This can result in your stream having to suddenly process much more data than anticipated, causing considerable performance degradation.
The second topic will be about stateful streaming parameters and the consequences of not tuning these parameters correctly. This can lead to infinite state accumulation, and can be another source of degraded performance, as well as memory issues. In the third topic we discuss Structure Streaming output parameters. When not addressed, this can lead to a severe case of the small files problem. In the final topic, we will cover what to think about when you want to modify your streaming job while it is already in production and checkpoints are involved. We will provide practical hands-on examples on when aforementioned issues manifest and how to prevent them from occurring in your production streams. By the end of the talk you will know what to look out for when designing performant and fault-tolerant streams.