The SQL tab in the Spark UI provides a lot of information for analysing your spark queries, ranging from the query plan, to all associated statistics. However, many new Spark practitioners get overwhelmed by the information presented, and have trouble using it to their benefit. In this talk we want to give a gentle introduction to how to read this SQL tab. We will first go over all the common spark operations, such as scans, projects, filter, aggregations and joins; and how they relate to the Spark code written. In the second part of the talk we will show how to read the associated statistics to pinpoint performance bottlenecks.
After attending this session you will have a better grasp on query plans and the SQL tab, and will be able to use this knowledge to increase the performance of your spark queries.
Speakers: Stefan van Wouw and Max Thone
– Hello everyone, and welcome to this Data AI Summit presentation, in which we will talk about supercharging your Spark Queries using Spark UI with the SQL Tab. So my name is Max Thone. I’m a Resident Solutions Architect at Databricks, and my co-presenter will be Stefan van Wouw who is a Senior Resident Solutions Architect also at Databricks. So what are we going to talk about today? So the first part of this presentation, we are going to give an introduction to the Spark SQL Tab, which is one of the tabs into Spark UI that most of you are appropriately familiar with. Next Stefan wanna introduce you to all the most common components that you will find in the Query Plan, which is again the Spark execution Query Plan within the Spark SQL Tab. And finally, I will then walk you through a couple of scenarios, some real world Sparks scenarios in which we can use the knowledge that we gained about to the Query Plan and to Spark SQL Tab, to optimize your Spark queries to make them really as optimal, as fast as possible. So, yeah that’s just it, in this first part, I’ll just give you a brief introduction to this Spark SQL Tab. So first of all, why should you really know about the SQL Tab? Well, really gives us two important pieces of insights, right? So it shows to how exactly the Spark query is executed. So as most of you probably know when you submit a Dataframe or Spark SQL query to Spark, it first goes fruit is optimization engine called catalyst before it actually generates this final Query Plan that gets entered into the JVM. And this SQL Tab gives us some insights into this final Query Plan before it goes into the JVM. And so we can use to SQL Tab to really reason about whether this final physical plan is actually the most optimal possible plan that can be executed. So we can also use this Query Plan and to SQL Tab to really reason about to query execution time. And again, critically think about whether it is the lowest execution time that we could possibly achieve using this query. So how do we actually get to the SQL Tab on the below part of the slide? You will see a screenshot of the Spark UI. So what you simply do is first you of course go to the Spark UI within your cluster, and then below there, you can click on the SQL Tab as you can see into little, a yellow rectangle. When you click on the SQL Tab, you’ll then see a list of all the SQL queries that you have executed within your cluster. And when you click on one of the description so a one description is associated with the SQL query. We then actually get a visual overview of the Query Plan which I will show you later on in this presentation. So what exactly is a Query Plan? So as I already just mentioned it Sparks SQL Dataframe or Dataset query, it will go through this optimization engine called the Spark Catalyst Optimizer before is finally being executed by the JVM. So actually the Spark Catalyst Optimizer really executes a series of optimization step before finally arriving at the final execution plan which is called the Selected physical plan that you can see on the right. And so what we typically meet with Query Plan is this final Selected physical plan which is the output of the Spark Catalyst Optimizer. So where does this Query Plan actually sit in a hierarchy? So I think most of you are already aware with this famous hierarchy that goes from Spark Job to job stage to Spark tasks. So if we now go from right to left, we know that one stage, one Spark stage actually executes a number of tasks and in your cluster each executor core can execute one task at a time, but and one Spark Job actually again, executes multiple stages. So maybe you already noticed but one Spark Job is usually cut down by the what we call shuffle boundaries, and shuffle boundaries then dictate how many stages you have for one Spark Job. The query actually sits, the Query Plan actually sits on top of these Spark Jobs. So one Spark query can actually generate multiple Spark Jobs. And then on top of that again, is this Dataframe action. So that could be something like Dataframe.count or Dataframe.writeto_parquet which can again actually generate one or more queries. So in this slide, I just want to show you a simple example of how does hierarchy actually plays out. So on the top left, we have a very simple Spark query in which we have a Dataframe that contains some sales data, and what we want to do here is we want to filter on certain item_id codes and for all these item_id codes we want to get the total number of sales for each item_id. So this will be a filter query off to which we do a group by, and then we aggregate the sales per item_id. We’re then going to trigger the query by doing a ride command on this Dataframe. So this is stand the actual Dataframe action which has done the first step. So this query will then generate in this case one SQL query which you can see on the top right next to number two. So that will be the associated SQL query that’s associated with this Dataframe query into in the following slides we will actually click on this query and see what kind of information we get then. So this query again, generates one Spark Jobs. So this will then be job 24 that you can see on the top right, and this job will again have generated two stages that you can see on the bottom right, which per stage 50 and stage 51. Finally, you can see on the bottom left on the summary Tab that these two stages in total then generates nine tasks. So what it just wants you to pay some attention to, is that one way to go to the SQL query is that you simply just go to the job that’s associated with this data from query, and then you can just click on the blue number next to it associated SQL query on the top right. So what does the SQL query look like? So we see a visual overview of it on the right and we see the actual generated physical plans. So that is the Query Plan on the button. And so we can actually do a simple mapping exercise, right? Using this simple query. So on the very top right of the SQL plan, we see what is called an InmemoryTableScan. And this InmemoryTableScan actually references the fact that we reach this Dataframe from a cached Dataframe. So it’s actually reading the Dataframe from in-memory and that is what InmemoryTableScan mean. Next we’re doing a filter operation, and that is what you can see by following the yellow arrow. So a filter operation using the different query actually just leads to a filter operation in this final physical plan. And then we see a bunch of operations has have to do with the aggregation query. So it actually performs two HashAggregate, and in which between is doing a shuffle exchange. And Stefan will explain a bit more on why it’s actually doing two HashAggregate. For now it’s okay to know that this group by plus aggregation actually leads to these aggregation steps in this final physical plan. The final step of this physical plan is then the actual right step which is denoted by the overwrite by expression that you can see on the bottom right. So this is really a very simple query and you can kind of already intuitively see how all these steps in the different query mapped to these physical plan operators that you see on the right and in the physical plan on the buttom. So, but what we now really want to cover in this presentation is, what possible operators exist in this physical plan? So we have now seen a filter and a HashAggregate, but what do these actually mean? and what more operators like these do we actually have? And next, we also of course have these small details the detail section within this plan that you can see on the rights, and the details they convey all the metrics, all the Spark metrics that are a component of this physical plan and Stefan and I were also go a bit more into detail in how to use these details to potentially optimize the query more. So finally, once we have all this knowledge so once we gave you this brief overview of all the operators and all the metrics that you can analyze using this SQL Tab I will then go over a couple of scenarios that uses the above not above knowledge to optimize to some examples, SQL queries. So with that slide I just finished the introduction to this presentation and Stefan will now take you for an overview of all the common components of the physical plan.
Stefan is a performance and scalability subject matter expert at Databricks. He has a background in parallel distributed systems and has years of experience in the Big Data Analytics field. More recently, he is focusing on deploying Structured Streaming applications at scale, advising clients on how they can build out their pipelines from proof of concepts to production grade systems.
Max is a machine learning and performance expert at Databricks. He has a background in machine learning and analytics, and has worked in analytics teams in a variety of industries. At Databricks he has been focusing on advising clients with implementing and tuning their Spark production jobs, ranging from solving skew problems to deploying streaming jobs at scale.