Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads

Download Slides

Specialized accelerators such as GPUs, TPUs, FPGAs, and custom ASICs have been increasingly deployed to train deep learning models. These accelerators exhibit heterogeneous performance behavior across model architectures. Existing schedulers for clusters of accelerators, which are used to arbitrate these expensive training resources across many users, have shown how to optimize for various multi-job, multi-user objectives, like fairness and makespan. Unfortunately, existing schedulers largely do not consider performance heterogeneity.

In this talk, we present Gavel, a heterogeneity-aware scheduler that systematically generalizes a wide range of existing scheduling policies. Gavel expresses these policies as optimization problems, making it easy to optimize for objectives in a heterogeneity-aware way, while also being cognizant of performance optimizations like space sharing. Gavel then uses a round-based scheduling mechanism to ensure jobs receive their ideal allocation given the target scheduling policy.

Gavel’s heterogeneity-aware policies allow a heterogeneous cluster to sustain higher input load, and improve end objectives such as average job completion time and makespan by up to 3.5× compared to heterogeneity-agnostic policies. This work will appear in OSDI 2020.

Speaker: Keshav Santhanam

Transcript

– Hi everyone, my name is Keshav and I’m a second-year PhD student at Stanford. And today I’ll be talking about Heterogeneity-Aware Cluster Scheduling policies for Deep Learning Workloads. This is joint work with my colleagues Deepak and Fiodar at Stanford, Amar at Microsoft Research, and my advisor Matei. So ML training is becoming a very important and also computationally intensive data centered work-load and venders are starting to deploy accelerator devices like Nvidia GPUs, Google TPUs, FPGAs in Azure and other domain-specific ASICs, in order to keep up with the growing pace of ML training. But there’s a question of how we should effectively allocate these heterogeneous resources, in order to maximize their performance. What organizations will typically do today, is have users submit their jobs, with some specific resource request, to essential cluster scheduler, which will take as input some cluster-wide objective, such as fairness, in order to determine which job should be allocated to which resources. But the issue with this model is that some users will try to select one resource type in particular which will end up giving those jobs to very long queuing delays, on those popular resource types. Meanwhile, the less popular resource types, will experience very large under utilization. So the work we’re trying to answer in this paper is how we should allocate heterogeneous resources to DL training jobs from multiple users while optimizing for different cluster-wide objectives. And there’s a number of challenges associated with this problem. The first of which, is heterogeneus performance behavior. So we observed that there are many different types of deep neural networks and each on of these models has different operators, for example, convolutions and attention blocks, which perform differently across different hardware architectures. So for example, if you look at this transformer model here, we see that it experiences a speed-up of around 3x on a P100 and a V100 GPU compared to a K80. And this is expected as the V100 and P100 GPUs are newer than the K80. But what’s maybe more surprising is that the magnitude of the speed-up, across GPU generations, varies significantly. So now if we look at the ResNet 50 model, over here, sorry, if we look at the ResNet 50 model, we’ll see that it experiences a speed-up of almost 10x, compared to the transformer, which is only getting the 3x speed-up. And so what this means is that if we don’t take into account this performance heterogeneity, this can give us unfair allocations. And we experience or we observe similar behavior from other model types such as the A3C, CycleGAN and ResNet 18. The next big challenge in scheduling these heterogeneus resources is that users have many different scheduling objectives that they’re interested in. For example, these can include single-job objectives, such as maximizing throughput or minimizing cost, and even combinations of these objectives such as minimizing cost subject to SLOs, which entails starting off a job on a slower, cheaper instance, and then moving it to a faster, more expensive instance as the deadline approaches. There are also multi-job objectives such as fairness, or even hierarchical policies, that, for example the one we see here, where an organization has a weighted fairness policy at the top and then each individual sub-organization has its own policy. So in summary, the challenges associated with allocating these heterogeneus resources are two-fold. Firstly, there’s many different model types as well as hardware architectures and this results in a lot of performance heterogeneity. The other challenge is that users have many different scheduling objectives that they’re interested in and we want to support all these different objectives. So to address these challenges, we propose Gavel, which is a new heterogeneity-aware cluster scheduler. Gavel generalizes a wide range of existing policies by expressing them as optimization problems where the output of an optimization problem is a theoretically optimal allocation. Formulating policies in this way, allows Gavel to provide an easy abstraction for incorporating performance heterogeneity into these scheduling policies. Furthermore, Gavel has a round-based scheduling mechanism to ensure that jobs receive their optimal allocation in practice. And Gavel can improve objectives, such as average job completion time by up to 3.5x. To give you an idea of how Gavel works, as before, users will submit their training jobs, written in an existing framework, such as PyTorch or TensorFlow, but now these are given to a throughput estimator which is responsible for profiling these jobs on the different resource types and producing a throughput tensor that captures this information. This throughput tensor is fed into the scheduling policy which is represented as an optimization problem that tries to satisfy some user objective. And this policy produces a theoretically optimal allocation which is given to the round-based scheduling mechanism. This scheduling mechanism decides, in a series of discreet rounds, which jobs should actually be placed on the physical resources. And once we make these placement decisions, we can observe the measured performance and use those measurements to update our estimates of the throughput. So for the remainder of this talk, I’ll first start off by talking about heterogeneity-aware policies, then I’ll talk about the round-based scheduling mechanism, I’ll briefly touch on the throughput estimation component of Gavel and I’ll wrap up with some empirical results. So, as I said, Gavel is able to represent scheduling policies as optimization problems where the output of those optimization problems is an allocation, which we can denote using a matrix X. X specifies the fraction of time a job spends on each accelerator type between allocation recomupations. To make this concrete, let’s look at a sample allocation matrix here. You’ll see that there are three different resource types, the V100, P100 and K80, as well as three different jobs. The first row of this matrix says that job 0 should be spending 60% of it’s time on the V100 and 40% of it’s time on the P100. We can visualize this in this diagram here, where we see the allocation between two recomputation events. These recomuptation events can be the arrival of a new job or a departure of a job or it can be just periodic intervals of time. So there are many different policies that we can make heterogeneity-aware. The first of which is LAS or least attained service. This is a policy that tries to maximize the minimum fairness quantity, which in this case is the total amount of time each job spends computing on physical resources. There’s also LAS with weights which is similar to LAS but also allows us to assign different jobs weights in order to compare them against each other. There’s also a minimize makespan policy which tries to run a batch of jobs as quickly as possible. And there are many others, as we list here, for example, finish time fairness, which is a policy proposed by a paper that was published earlier this year; FIFO and shortest job first are classic scheduling policies, of course; and then there’s minimize cost subject to SLOs which I referenced earlier; as well as these hierarchical policies. So now let’s talk about how we would actually represent a policy as an optimization problem. In the homogeneous case, we have policy objectives that are functions of throughput, where throughput is measured in training steps per second. But now in the heterogeneus case, we can express these policy objectives in terms of a quantity we call effective throughput. The effective throughput of a job is simply the weighed average of a job’s throughput on each resource type, where the weights are given by the allocation matrix X. So in particular, the throughput of a model m, given an allocation matrix X, is the sum over all resources types j, of the throughput of model m on resource type j, times the model allocation on resource type j. And you can see here in example throughput matrix. So now let’s look at one policy in particular and how we can make this heterogeneity-aware. So we’ll look at the least attained service policy which is trying to equalize the amount of compute time each job receives and to express this as an optimization problem, we can try to find the allocation matrix that maximizes the minimum allocation for any given model. But now if we want to apply this in a heterogeneus case, we’ll see that this policy will lead to unequal throughput reductions due to the heterogeneus performance behavior we observed earlier. So, to fix this, what we can do is incorporate this notion of effective throughput into the policy. So now our policy is trying to find the allocation that maximizes the minimum effective throughput for any given job. And we include a normalizing factor in order to compare the effective throughputs between different jobs. So there’s many other policies that we can make heterogeneity-aware. I’m only going to discuss the LAS policy in this talk but I encourage you to see the paper for more details. The nice thing about expressing policies as optimization problems is it also enables us to look at optimizations like space sharing and placement. Space sharing is when we try to pack multiple jobs onto the same GPU in order to improve GPU utilization. And placement awareness tries to consolidate distributed jobs onto a single server to minimize communication overhead. To support these optimizations, we don’t actually need to change the objectives, in terms of effective throughput, but we may need to change the allocation matrix to account for whichever performance optimization we’re interested in. So, for example, with space sharing, we might need to add rows to our allocation matrix for every combination of jobs. We may also need to modify how we’re collecting throughputs which I’ll talk about a bit in this talk but I encourage you to see the paper for more details on this. So now let’s talk about the round-based scheduling mechanism. The goal here is we want to, given a heterogeneity-aware allocation, we want to determine how to allocation physical resources to jobs in order to respect that theoretical optimal allocation. So here we have an example allocation that incorporates both performance heterogeneity as well as space sharing. So, you’ll see there’s a row in there to indicate that job 0+1 should be spending 100% of their time on a V100 GPU. And what we want to look at here is how do we take this allocation and turn it into an assignment of jobs to heterogeneus cluster resources. So what Gavel has is a round-based scheduler which ensures that jobs receive time on the accelerator types according to this computed optimal allocation X. And we can see how this works in this diagram. So you’ll see that the time is divided into a series of discreet rounds. In the first round, job 3 is running on the P100 and job 2 is running on the K80 but then in the next round we switch these and have job 2 run on the P100 and job 3 run on the K80 and this allows us to respect the theoretical allocation which tells us that jobs 2 and 3 should each be spending half their time on the P100 and K80 GPUs. In order to actually determine which jobs are going to be assigned to which resources, we use priority scores. So the priority score of a job is going to be high when the job has received less time on that particular resource type than it was supposed to according to the theoretical allocation. So let’s look at an example of this. We have here the example allocation matrix we were looking at earlier. But now we have another matrix which is rounds received which tells us how much time each job has actually spent executing on each resource type and you’ll notice for job 0, in the theoretical allocation it was supposed to be spending 60% of it’s time on the V100 but in reality it spent 75% of its time there. So now to compute priority scores, what we can do is take the ratio between this theoretical optimal allocation and this rounds received matrix and then now we’ll see that job 0, for the next round, will have highest priority on the P100. Similarly, job 1 and job 2 will have highest priority on the K80 and V100 respectively. So now if we look at the next round, we’ll see that job 0 has been placed on the P100, job 1 has been placed on the K80 and job 2 has been placed on the V100. We see that round durations in our experiments are six minutes. I encourage you to check out the paper for a justification of the six minute amount of time and the other thing to note here is that if we find that jobs are not changing resource types or physical resources rather, between rounds, we can issue them lease extensions so that we can minimize preemption overhead. So now let’s talk about the throughput estimation component in Gavel. So there’s a couple different questions we want to answer here. The first one is, how well does a newly arrived job perform in isolation on each worker type? And then if we’re interested in space sharing, we also want to think about how well does a newly arrived job co-locate with other jobs on each worker type? And the second question is going to be challenging to answer if we do it naively because if we just measure the combination of this newly arrived job with all other active jobs, we’ll find that there’s on the order of k times n different combinations we need to try. Where k is the number of resource types and n is the number of active jobs. So to do this more efficiently, we can use a matrix completion approach which was originally proposed by the Paragon and Quasar papers. And I’ll give you a brief overview of how this works. So when a new job arrives, instead of measuring the throughput with every other active job, instead we’ll measure the throughputs of a set of reference jobs offline and produce a reference throughput matrix which we have here as R and then when the new job arrives, we’ll simply make a copy of R and append a new road to it for the new job and we’ll only measure a subset of the entries in that row. Then we’ll use matrix completion to complete the rest. Then once we’ve filled out this row, we can have a fingerprint of the newly arrived job by comparing it to every row in the reference throughput matrix and finding the row that’s most similar to this new row we’ve appended. And once we have a fingerprint for this job, we can compute an initial estimate of the throughputs for this job and when co-located with all other active jobs, by taking the fingerprint of this job and matching it up with the fingerprint of every other active job. So now let’s talk about some empirical results. There’s a few different questions we want to answer as part of this evaluation. The first is whether Gavel’s policies improve objective metrics in a physical cluster. The next questions is what the impact of input load is on objectives using Gavel’s policies? Then we want to look at whether Gavel’s policy framework can support hierarchical policies. We’ll examine whether Gavel’s policies scale with the number of active jobs. We’ll also think about whether Gavel’s scheduling mechanism can realize optimal allocations. And finally, we’ll look at the overhead of preemption in Gavel. So starting off with this first question, of whether Gavel can improve objectives on a physical cluster. The set up here is we have a cluster with 48 total GPUs, split between 8 V100s, 16 P100s, and 24 K80s. And, there’s a couple different scenarios we’re looking at, the first is the set up where we want to look at the least attained service policy, in a couple different regimes, so the first is a heterogeneity-agnostic version of this policy and then we also want to look at a heterogeneity-aware version and what we’re doing is looking at a trace where jobs are arriving according to a Poisson arrival rate and we want to capture the job completion times of 25 jobs after the system has reached steady state. And so what we find is that heterogeneity-aware version of this is able to improve the average job completion by 1.5x. The next scenario we’re looking at is a policy to minimize the makespan so we have a trace with 100 jobs submitted all at once and we want to try to complete these as soon as possible. Again we have a heterogeneity-agnostic policy as well as a heterogeneity-aware policy. The difference here is that the heterogeneity-agnostic policy also enables ad hoc space sharing. And what we find is that the heterogeneity-aware policy, despite not using space sharing, is able to outperform the heterogeneity-agnostic version by 1.2x because it’s able to place jobs on the resource types that they actually benefit the most from. The other take away from this slide is that we have a simulator for Gavel in order to model different experiments without having to use actual, physical resources and we tried to compare the simulator with the physical cluster results and we found that the results match up quite closely with no more than an 8% difference across the board for all results. So now let’s look at whether Gavel can enable the same heterogeneus cluster to support higher input load. The set up here is we have a simulated cluster with 108 total GPUs, split evenly between V100s, P100s and K80s and we’re evaluating each policy on multiple different traces where the difference between each trace is the Poisson arrival rate which determines the inter arrival rate between different jobs. So we have three different results here. The first is heterogeneity-agnostic LAS policy. Then in orange we have heterogeneity-aware policy and in green we have a heterogeneity-aware policy with space sharing enabled. And there’s a couple different takeaways here. The first is that with the heterogeneity-aware policies, we’re able to support a higher input job rate, compared to the baseline. And then for a given input job rate, we’re able to support up to a 3.5x better average job completion time. We can also look at a CDF of the job completion times and we’ll find that with the heterogeneity-aware policies, we have a shorter CDF tail. Now let’s look at whether Gavel can support hierarchical policies. To examine this, we have this set up here, where there’s a top level organization and three sub-entities. Each entity has a weight associated with it where the first entity has a weight of one, entity 1 has a weight of two, and entity 2 has a weight of three. The fact that entity 1 has a weight of two means that it should get twice as many resources as entity 0 and there’s six different jobs for each entity. So we’ll submit all the jobs for entity 0 first, then entity 1 and then entity 2. So what we’re showing here is the fraction of total effective throughput achieved by each entity. And, as we start adding jobs, for the second entity, we’ll see that the total amount of resources that entity 1 has received, is twice the allocation of entity 0, which is what we expected. And now if we add all the jobs from entity 2, we’ll see that the allocation is in the ratio of 3:2:1, which matches up with the weights we gave each entity. And the width of the bars indicates that both the inter- and intra-entity weights are respected. So now let’s look at how Gavel scales, in terms of how its policies scale as we increase the number of jobs. We see here that for an LAS policy and a hierarchical policy, without space sharing enabled, we’re able to scale to roughly 2000 jobs, in a minute or less. If we start looking at space sharing, the performance isn’t as ideal, but we’re still able to support a large number of jobs in a reasonable amount of time. So now let’s look at whether Gavel’s scheduling mechanism can match an ideal execution. Where here the ideal execution is measured by trying to give every job exactly the amount of time it was supposed to receive on each resource type according to the optimal allocation. And so we can do this in simulation and we find that, with six minute rounds, Gavel’s scheduling mechanism almost exactly matches this ideal baseline. And finally, let’s look at whether Gavel’s scheduling mechanism introduces any preemption overhead. So here we’re showing the preemption overhead for six different job types and we find that across the board, even without lease renewals, we find that with six minute rounds, jobs experience a preemption overhead of less than 3% and it’s even lower with lease renewals. So in conclusion, Gavel is a heterogeneity-aware cluster scheduler that’s able to optimize for many high-level objectives such as fairness, makespan and cost. Gavel formulates existing policies as optimization problems and extends these problems to be heterogeneity-aware. And Gavel can reduce metrics such as average job completion time by up to 3.5x. We’ve open sourced our code at the following link and I encourage you to send me emails if you have any follow up questions or discussions. So thanks for listening and I’m happy to take any questions.


 
Watch more Data + AI sessions here
or
Try Databricks for free
« back
About Keshav Santhanam

Stanford University

Keshav is a second-year PhD student at Stanford University advised by Professor Matei Zaharia. He's a member of the FutureData Systems research group and the Stanford DAWN group. His research is focused on building systems and infrastructure to accelerate machine learning workloads. He earned a BS in Computer Science from the University of Illinois at Urbana-Champaign (UIUC) in 2017 and an MS in Computer Science from Stanford in 2019 (dual concentration in Systems and Artificial Intelligence). At UIUC he worked with Professor Indranil Gupta in the Distributed Protocols Research Group.