Max is a machine learning and performance expert at Databricks. He has a background in machine learning and analytics, and has worked in analytics teams in a variety of industries. At Databricks he has been focusing on advising clients with implementing and tuning their Spark production jobs, ranging from solving skew problems to deploying streaming jobs at scale.
Running a stream in a development environment is relatively easy. However, some topics can cause serious issues in production when they are not addressed properly. In this presentation we want to cover 4 topics that, when not addressed, can lead to serious issues for streams in production. The first topic considers what happens if input parameters of your stream are not properly configured. This can result in your stream having to suddenly process much more data than anticipated, causing considerable performance degradation.
The second topic will be about stateful streaming parameters and the consequences of not tuning these parameters correctly. This can lead to infinite state accumulation, and can be another source of degraded performance, as well as memory issues. In the third topic we discuss Structure Streaming output parameters. When not addressed, this can lead to a severe case of the small files problem. In the final topic, we will cover what to think about when you want to modify your streaming job while it is already in production and checkpoints are involved. We will provide practical hands-on examples on when aforementioned issues manifest and how to prevent them from occurring in your production streams. By the end of the talk you will know what to look out for when designing performant and fault-tolerant streams.