The SQL tab in the Spark UI provides a lot of information for analysing your spark queries, ranging from the query plan, to all associated statistics. However, many new Spark practitioners get overwhelmed by the information presented, and have trouble using it to their benefit. In this talk we want to give a gentle introduction to how to read this SQL tab. We will first go over all the common spark operations, such as scans, projects, filter, aggregations and joins; and how they relate to the Spark code written. In the second part of the talk we will show how to read the associated statistics to pinpoint performance bottlenecks.
After attending this session you will have a better grasp on query plans and the SQL tab, and will be able to use this knowledge to increase the performance of your spark queries.
Speakers: Stefan van Wouw and Max Thone
Stefan is a performance and scalability subject matter expert at Databricks. He has a background in parallel distributed systems and has years of experience in the Big Data Analytics field. More recently, he is focusing on deploying Structured Streaming applications at scale, advising clients on how they can build out their pipelines from proof of concepts to production grade systems.
Max is a machine learning and performance expert at Databricks. He has a background in machine learning and analytics, and has worked in analytics teams in a variety of industries. At Databricks he has been focusing on advising clients with implementing and tuning their Spark production jobs, ranging from solving skew problems to deploying streaming jobs at scale.