Stefan van Wouw

Sr. Resident Solutions Architect, Databricks

Stefan is a performance and scalability subject matter expert at Databricks. He has a background in parallel distributed systems and has years of experience in the Big Data Analytics field. More recently, he is focusing on deploying Structured Streaming applications at scale, advising clients on how they can build out their pipelines from proof of concepts to production grade systems.

Past sessions

The SQL tab in the Spark UI provides a lot of information for analysing your spark queries, ranging from the query plan, to all associated statistics. However, many new Spark practitioners get overwhelmed by the information presented, and have trouble using it to their benefit. In this talk we want to give a gentle introduction to how to read this SQL tab. We will first go over all the common spark operations, such as scans, projects, filter, aggregations and joins; and how they relate to the Spark code written. In the second part of the talk we will show how to read the associated statistics to pinpoint performance bottlenecks.

After attending this session you will have a better grasp on query plans and the SQL tab, and will be able to use this knowledge to increase the performance of your spark queries.

Speakers: Stefan van Wouw and Max Thone

Running a stream in a development environment is relatively easy. However, some topics can cause serious issues in production when they are not addressed properly. In this presentation we want to cover 4 topics that, when not addressed, can lead to serious issues for streams in production. The first topic considers what happens if input parameters of your stream are not properly configured. This can result in your stream having to suddenly process much more data than anticipated, causing considerable performance degradation.

The second topic will be about stateful streaming parameters and the consequences of not tuning these parameters correctly. This can lead to infinite state accumulation, and can be another source of degraded performance, as well as memory issues. In the third topic we discuss Structure Streaming output parameters. When not addressed, this can lead to a severe case of the small files problem. In the final topic, we will cover what to think about when you want to modify your streaming job while it is already in production and checkpoints are involved. We will provide practical hands-on examples on when aforementioned issues manifest and how to prevent them from occurring in your production streams. By the end of the talk you will know what to look out for when designing performant and fault-tolerant streams.