Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Problematic Queries

Download Slides

John submits a query and expects it to run smoothly. Based on his prior experience, he anticipates the query to finish in 20 mins.
Scenario-1: John’s query finishes execution in the expected timeframe and doesn’t impact any other concurrent query in the workload.
Scenario-2: John’s query takes twice the expected time, and also slows down multiple other concurrent queries. John now wonders “should I have submitted this query?”.

At Unravel, we have implemented a Dash-app, called qSteer, that proactively alerts Spark users, like John, about a possible slowdown that their query can cause or face if submitted to the cluster. Additionally, it auto-suggests the users a few future time slots that they can choose to submit the query alternatively. Query plan details fetched from SparkSQL, and execution performance metrics from Spark core have enabled us to generate query quality predictors. This data combined with historical query executions and current cluster state helps us derive a set of likely outcomes “if” the user’s query is submitted. qSteer uses ML techniques to flag a schedule “unacceptable” based on prior experiences, and alerts the user of possible delays from submitting a query.

During the talk, we will discuss the architecture, our core algorithm for performing predictive analytics, and demo our app to showcase how users can use it with ease to decide whether to submit their query or not at any time. We will share our experiences of using qSteer at scale at one of our customers – specifically, the challenges we faced, scenarios where it worked/not worked, and some learnings along the way.

Speakers: Prajakta Kalmegh and Yusaku Sako

Transcript

– Hello everyone. Thanks for joining today for an exciting session from Unravel Data Systems. I’m Prajakta Kalmegh and I’m working as a principal engineer at Unravel Data and I’ll be joined today by Yusaku Sako, who is basically heading some exciting projects in data science team. So, Unravel has a very strong tech background and a rich experience owing to the broad coverage it has in various technologies and cloud platforms. We are very proud to have served and entrusted by our customers and our partners who value Unravel today as one of the most reliable and trusted vendor. So, what is DataOps? So, DataOps practices basically automate and provide continuous quality checks to bring speed and agility throughout your data analytics process. So, we at Unravel basically understand the complexity and the challenges you face for every stage of the DataOps cycle. Today we provide solutions for observability and real time objectionability. We go a step beyond from just providing passive monitoring, using AI and ML to automate most of your DataOps tasks. For example, we help you plan the infrastructure, the environment, and select the right tech and the topology for your cloud. We understand the development hiccups and provide solutions for the users to fine-tune their apps early on. Our solutions offer deeper capability of data-oriented pipelines, which helps the user test the correctness and performance efficiency of their pipelines, even before they go to production. We understand the challenges the user faces when trying to schedule a new workload, selecting the container sizes, and our softwares enable users to handle this process with ease. Our customers have used Unravel successfully to not only tune their apps, but also eliminate rogue behavior, or even to move some apps to a new environment at runtime. Finally, we create solutions to continually monitor your apps as they are running in production to find issues, to detect issues, provide fixes, optimized, to accelerate, and even to accommodate additional workloads. So, today in this talk, we are, we’ll be showcasing some very cool and impactful use cases that you can use throughout the code, the test, the deploy and operate phases of the DataOps cycle. Specifically, we will show how using the Unravel API, users can customize their dashboards for every person in the company, right from data engineers, to executives. For example, using the Unravel API, developers, the data engineers, the data analysts, the data scientists, they can use to submit ad hoc apps with ease by detecting issues early on. We will also see a demo of how a scheduling assistant tool can be used by the platform team during their deploy phases to schedule new workloads. And finally, we will see how Unravel uses predictive analytics to provide operational actionable insights. So, let’s see how we can speed up the development. So, we want to see how the day of our developer John is going today. So, John very happily and hopefully submits a very simple query that, aggregate sales by age and he is expecting the query to run fast because he has done this several times before. But today for some reason, the query is taking long and way too long. Once the query completes execution, John basically tries to debug. Hey, what was the issue? Why did the query takes so long? And he detects that one of my tasks suffers from data skew, but John knows a fix. So, he submits the query again, but this time, apparently it didn’t work, as well. So, then he remembers that, hey in the past, I had used a trick, like provided a hint for broadcast hash join, and that made it work, maybe let me try that. But even after a lot of time, at the end of the day, John is still wondering why didn’t it work? Well was the cluster an issue? Maybe my small table was not small enough or maybe the timing was just not right? So, let us take a look at why this process is hard, right? So, developers today suffer from multiple issues during the development cycle. They face challenges and they try to figure out what went wrong. Sometimes they can even find out the root cause, and like, for example, my data was skewed on a particular column or they tried to pull out some workarounds from the past and play around with the query, but when things just don’t seem to work, you start looking for clues. Maybe the query was not resource efficient or maybe the cluster was just bottle necked and the user doesn’t have a clue of what went wrong. There will be task failures, contentions would have led to some slow tasks. So, the key to your DataOps success is how efficiently you adopt and implement the interactions between your people, the process, and the technology. So, John, users like John today, need tools like Unravel to bring out the data-driven insights and generate some key performance predictions, even before they submit the query, as they code, right? So, let us take a quick demo of how Unravel integrates the Unravel API with the Jupyter notebook to help users, like John, to figure out whether to query or not to query.

– Hi, I’m Yusaku Sako. I’m the head of data science at Unravel data and today I’m super excited to reveal to the public for the very first time, how Unravel can bring you AI-driven actionable insights and recommendations for your big data apps as you’re coding in your favorite IDEs, such as Jupyter notebook. As you will see, this is very different from what happens today where you need to jump around between different UIs and logs after your job executed to try to connect the dots and still not know what to do. So, let’s see this in action. So, to connect this Jupyter notebook to Unravel AI engine API, all you need to do is install the Unravel AI package and import it as shown on the screen and basically you just need to make a single call to connect with the Unravel API endpoint and a valid API key. So, let’s do that. Now we’re connected and now we’re ready to analyze Spark apps through this notebook and to perform analysis, you use Jupyter magic with double , this is called magic and for Unravel Spark analysis, we have a custom magic like this, and what you need to do is supply an optional ID, which is the variable name, where does he fall query string will be written to for your convenience and also days, for number of days for analysis, as you will. This comes, this will become clear to you shortly. So, let’s say that you have this, SQL query where you have a join between two tables. So, let’s ask Unravel AI and see what’s happening with this query. So, I just submitted a request and this is the API response that came back, and this is saying that in the past 30 days in this cluster, there were 17 similar runs of this query with the following statistics: the fastest one run 34 minutes, the longest took 2.8 hours, and the median duration was 55 minutes, with a 100% success rate. So, for this query, one issue was detected and it was query slowness due to data skew. So, the Unravel API analyzed and found that the join between these two tables, the back table and the dimension table, is problematic because there’s data skew on the fact table, and so he came back with the recommended action to increase this parameter auto broadcast join threshold to 232 megabytes from the default of 10 megabytes and it also explains that enabling broadcast join would counteract data skew and it can speed up the query substantially. And how did Unravel come up with this recommendation? So, let’s take a look. So, whenever there’s an Explain link, you can click on it and it will reveal more information. So, for this particular query, it found 17 similar query runs in the past, over the last 30 days. And so the key here is that between the two tables that were joined, the smaller table, which in this case is global dimension, the data size was 202 megabytes maximum over the last 17 runs and you can see it’s hovering around the 200 megabytes with 202 max. So the recommendation to increase this value is based on this data and plus some headroom to make it 232 megabytes, and so whenever there’s a data skew problem and the smaller table is reasonably sized, as in this case, then broadcast join can help speed up your query, and that’s what the Unravel recommends. And you didn’t need to know anything about this auto broadcast join threshold. You didn’t even need to look at historical runs of a query. Unravel just found it for you and made this recommendation. So, let’s take a look at another example. So let’s submit a request to the Unravel AI API, and it came back with this following result. So, this query is a little bit different from the previous one, in that, so we found 12 similar runs. In general, this query takes a little bit longer to run, but, as you saw in the previous query, the issue found is the same. So, this is query slowness due to data skew, but interestingly enough, the recommended action is actually different this time. So, the recommended action is to enable adaptive query execution and also to set skew join, adaptive skew join enabled. And why did it give a different recommendation this time for the same issue? So, let’s take a look. What Unravel did was to take, it took these 12 similar runs, and again, did a similar analysis where it looked at the data size of the smaller table in the join and, in this case, this table was a little bit bigger, as you can see in the past 12 runs. So, in the previous example, the table size was 200 megabytes, and this time around, it’s around 1.6 plus gigabytes, and this is much larger, and whenever you have the smaller table with data size, this big, more than a gig and your query has data skew, then adaptive query execution with skew resistance, it’s recommended and it should perform well in such scenarios. There’s a caveat. So, this adaptive query execution is a new feature in Spark. So, if you don’t have Spark 3, then you can’t use these settings, enable this, because it’s not there. So, in such scenarios, Unravel would automatically detect that condition and say, okay, adaptive query execution is not available. Then we’ll recommend a different technique such as sorting to count counteract data skew issues. So, again, you didn’t have to know anything about adaptive query execution or anything about broadcast join or the join and the smaller table data size. It just, Unravel did it for you automatically and pretty much instantaneously. Let’s take a look at another example. So, this query is a bit different from the previous two and runs relatively quickly, but there’s still some issues or definitely room for improvement. So, a couple of issues are detected. One is low utilization of CPU resources and another is low utilization of memory resources and specifically, Unravel was recommending to increase Spark executor cores to two from previous setting of one and also to decrease Spark executor memory to about 2.6 gigs from the previous setting that was used for the past runs, 4.5 gigs, and this explains that applying these recommendations will speed up your application while reducing allocated memory. Okay, so let’s take a look and see how these would be right. In a similar manner, Unravel looked at, found seven similar runs of this query, and actually looked at the execute or memory usage and not just the container allocation maps. So, the line that you see in green, this is actually a VM RSS from the executor. So, in this scenario, there was only one executor and the blue line shows the Spark executor memory setting that was used and the orange dotted line shows the recommended new value for Spark’s executor memory and this is actually derived in such a manner that these two setting changes work well in conjunction, meaning increasing the core’s size, but it’s still calculating the memory to go well with it and to manually come up with such combinations that work well against a multitude of factors involved, is a difficult task and not an easy one. So, as you saw in these three examples, where radically changing the way that app development and troubleshooting can happen by bringing actionable insights directly into your developer environment.

– So, we just saw an exciting demo, and I just wanted to talk about how does Unravel basically do this, right? So, Unravel basically exploits the fact that users like John, they issue similar queries and multiple users who are sharing the cluster, thus faced similar challenges, right? And they often also repeat their mistakes again and again. So, by exploiting the data from the same user and other users, Unravel basically gets a holistic view of what worked in the past when your query was executed and what did not. It uses this information to generate the key performance predictions as you code. So, now let us take a look at the second use case of how Unravel helps users in the deploy stage to schedule new workloads. So, it is very crucial to provide contextual insights to be truly data-driven today, right? So, for example, users who want to schedule a new workload, they would say like, hey, I have my query several times and I have tested it on the UAT cluster multiple times. So, I know my source requirements. So, can I just check what is a good time slot where I can schedule the query, right? Or even very pressing use case is like, users who want to generate a report by a particular time. Say, I want my report by 7 am on Monday morning, and you can schedule it whenever you want, but I just want my report to be ready by 7 am, right? So, it is important to contextualize whether it is a resource constraint or a time constraint, or neither of them, right? A user might come and say that I have a new workload that I need to schedule on this cluster. What is the best time? When should I start the pipeline to, for it to run efficiently? So, how do we find the missing piece of this puzzle? Do we look at the cluster issues? Maybe the tasks, failures, or the contentions, or the memory issues, or do we look at the query problems? Like, has the input data size increased over time? Is front heavy or is it both like, a user trying to schedule a shuffle-heavy query at a time when the cluster is known to have more network load. So Unravel contextualizes these insights by generating the predictive insights. Our algorithm basically works harder till it finds the right slot, or as I like to call it, till it times it right. So, let’s take a look at a quick demo of how Unravel integrates with the API to provide scheduling assistance. So, if you see the screen, basically users can come to Unravel, select the cluster on which they want to say schedule, select the window for the analysis and users will be able to see what are the available vCores or available memory in the particular time slot. So, what this heat map shows is the seven days of the week and the 24 hours and at every slot, what was my available vCores. So, the user who is confident about say the resource requirements, say, I want a 10 vCores and I want, say 16 GB memory, right? And my app is expected to run for 30 minutes. Can you recommend me some good slot to schedule it? So Unravel comes with the recommendation and sees that, hey, maybe it is good that you can schedule the query at 4 pm, right? You have the required vCores and memory and the users can then click on a particular slot and also see, like, maybe I want to take a look at what are the other pipelines or apps that typically run in the same time slot that, to get me an idea of, who will I be running concurrently with. Maybe the user doesn’t want to submit the query on Monday, right? He prefers a different slot. So he can click on the all feasible tab here and then Unravel generates all the feasible slots in the heat map, basically where the user can schedule and maybe the user can decide like, hey, how about like I submit on say Thursday because I have two more available slots after that and even if my query is spilling over or not finishing in time, I still have ample room for it to execute properly. So, Unravel basically utilizes the cluster metric time series, the cluster statistics, right and the query execution variability data, the resource profiles, the consumption history. Do you know when exactly it worked like, when exactly your query was able to perform well, in what circumstances and when it did not? Using this data, Unravel generates its predictive analytics. So, apart from the two very cool use cases that we have seen, Unravel also generates operational and actionable insights. So, we don’t have time to do the demo for this today, but I just wanted to real quick show this screenshot where what we see here is a pipeline with multiple components. We see a timeline of each of the component, the input size for each of the component of the pipeline, and what you can see prominently here is like, hey, the pipeline was delayed and it was delayed by about 26% above the average execution time. So, when the user clicks on the insights that Unravel generates for this pipeline, right, they can see that input read was actually 40% higher than the baseline model. And just right next to it, the users can also change the metric and see that my query had a consistent input data size till now, but in this particular anomalous instance, it peaked and went way above, like from 2.8 GB to around 4.8 GB. So, how does Unravel basically do all this, right? So, Unravel employs adaptive data collection strategy, the Unravel demons collect information about everything in the ecosystem, right? Like from your apps, the pipeline, the containers, the system, the data sets and so on, and it uses this data to build robust ML model, which generates query quality predictors and the cluster state indicators, which is then used by the Unravel API to kind of provide visualizations for every stage of your DataOps. So, in this talk today, what we have seen is basically three different use cases, which show us that you can very seamlessly integrate the Unravel API and make it accessible to every single person now in the DataOps cycle, your developers, your platform team, your executives, who can derive much more data-driven and accurate insights with Unravel. So, just to recap, what does Unravel do, right? It detects issues as early as possible and what does that mean for a developer like John? It just means that you come in happily, submit a query at 8 am, the query completes in a very short time, John ends up feeling very ecstatic and accomplished because he had a successful query execution early in the day and it was a predictable execution based on his expectations. And finally John finds the time and the opportunity to invest on the right things and to innovate and enjoy the rest of his day. So, that is all from Unravel Data Systems. Thanks for watching and we’ll be happy to take any questions.


 
Watch more Data + AI sessions here
or
Try Databricks for free
« back
About Prajakta Kalmegh

Unravel Data

Prajakta Kalmegh received her Ph.D. degree in Computer Science from Duke University in May 2019. Her dissertation work focused on analyzing resource contentions in data analytics cluster frameworks, and using these insights to make dynamic and contention-aware scheduling decisions for an online heterogeneous workload. Before joining Duke, she received her M.S. degree in Computer Science from Georgia Institute of Technology in May 2010. She has an extensive industry and research experience with esteemed companies like IBM Research Labs, Microsoft Corporation, SAP Labs, and Persistent Systems. She currently works as a Principal Engineer at Unravel Data.

About Yusaku Sako

Unravel Data

Yusaku Sako is Head of Data Science at Unravel Data, where he combines his passion for data science, machine learning, and big data to deliver high-value insights to Unravel's customers to help them get the most out of their big data investments. Prior to joining Unravel, he was Director of Engineering at Hortonworks where he led the development of Apache Ambari for which he served as the Project Chair (VP Apache Ambari).