Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these large models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on a single GPU or even on a multi-GPU server; and b) the number of compute operations required to train these models can result in unrealistically long training times. New methods of model parallelism such as tensor and pipeline parallelism have been proposed to address these challenges; unfortunately, naive usage leads to fundamental scaling issues at thousands of GPUs due to various reasons, e.g., expensive cross-node communication or idle periods waiting on other devices.

In this work, we show how to compose different types of parallelism methods (tensor, pipeline, and data parallelism) to scale to thousands of GPUs, achieving a two-order-of-magnitude increase in the sizes of models we can efficiently train compared to existing systems. We discuss various implementations of pipeline parallelism and propose a novel schedule that can improve throughput by more than 10% with comparable memory footprint compared to previously-proposed approaches. We quantitatively study the trade-offs between tensor, pipeline, and data parallelism, and provide intuition as to how to configure distributed training of a large model. The composition of these techniques allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs with achieved per-GPU throughput of 52% of peak; previous efforts to train similar-sized models achieve much lower throughput (36% of theoretical peak). Our code has been open-sourced at https://github.com/nvidia/megatron-lm.

Deepak Narayanan, Final-Year PhD Student, Stanford University

Deepak Narayana…: Hi everyone. I’m Deepak. I’m a final year PhD student at Stanford. I’m advised by my [TASR] here. And I’m excited today to talk about work that I’ve been doing at NVIDIA to train large-scale language models on GPU clusters. Machine learning has powered state-of-the-art results across a broad range of applications. As the amount of available compute and training data has increased, so too has the size of machine learning models. This graph, shows us the sizes of state-of-the-art models using natural language processing. Over the last three or so years, the number of parameters in these models have increased by more than three orders of magnitude, drastically increasing the number of computer operations needed to train these models as well. In this word, we discussed various techniques needed to train models with up to a trillion parameters, which far exceed the memory capacity of single GPU in reasonable timeframes.

Using a combination of various techniques, we are able to efficiently run training iterations for GPT models with up to a trillion parameters using 3000 plus A100 GPU’s with extremely graceful scaling. With this high throughput, we estimate that we will be able to train a model with trillion parameters end to end in about three months using 3072 A100 GPUs. Our results are a by-product of optimized hardware with state-of-the-art GPUs and high bandwidth interconnects between GPS on the same and different servers. A hybrid paralezation scheme that minimizes the amount of communication across GPUs while still keeping devices busy and various domain specific optimizations to target both computation and communication.

Our code is open source on GitHub, and we would love for you to check it out. First, I’m going to describe the hardware setup on which we obtain these results. We run experiments on Selene, a GPU cluster with A100 GPUs. If A100 GPU can sustain a peak device throughput of 312 teraflops per second. GPUs within a cluster node are connected by NVLink and NVSwitch, and GPUs across cluster nodes are connected by 200 gigabits per second Infiniband cards. In aggregate, Selene is able to provide 2.8 exaFLOP/s of peak AI performance.

It turns out that even such an optimized hardware setup is not sufficient to scale to large models. In particular, the model training graph needs to be distributed among the training resources in a clever way in order to make sure that the amount of communication being performed across devices does not overwhelm end-to-end computation time. Training a Deep Neural Network Model at a high level involves finding weight parameters, W, that fit a training data set, consisting often of inputs and associated labels. A forward pass through the model generates intermediate activations, as well as a prediction. These predictions could be incorrect. In this particular instance, we predicted that this picture of a tiger is actually a lion. Errors between the predictions and the true labels are back propagated through the model in a backward pass generating weight gradients that are then used to update the models parameters. The backward pass uses both the weight parameters and intermediate activations as inputs in order to compute the weight gradients. This optimization procedure is performed in iterations, and each iteration can be paralyzed within the GPU as well as across GPUs.

Each iteration of model training can be distributed across multiple GPUs in various ways. For example, the common way of distributing model training for models that fit on a single worker is to use a paradigm called data parallelism. With data parallelism, every worker has a copy of the model. The input dataset is sharded, and each worker generates a weight gradient on its subset of the data completely independently. At the end of every iteration, workers aggregate the weight gradients across the different workers in order to make sure that every worker sees a consistent version of the weights.

Another way to paralyze model training is to use model parallelism, which is typically a technique used to train models that are larger than the memory capacity of a single GPU. The model parameters can be split over the available workers or devices in multiple ways. For example, in tensor model parallelism, each operator or layer in the model can be split over the available workers or devices. With interlayer model parallesim on the other hand, we assigned each layer or operator to different workers. In this particular example, layers one and two are assigned to worker one and layers three and four are assigned to worker two.

Depending on how the model training draft is split over the available devices, different communication patterns are needed. For example, with data parallelism, we need to perform all to all communication of the weight gradients in order to make sure that every worker sees a consistent version of the weight parameters. With tensor model parallelism, we need to perform all to all communication of the partial activations ingredients generated on each worker. And with inter layer model parallelism, we need to perform point to point communication of the intermediate activations ingredients generated on each layer subset.

The upshot of all of these is that certain types of parallelism methods require collective communication such as data parallelism and pencil model parallelism while other methods of parallelism, such as interlinear model parallelism can get away with much cheaper point to point communication. As it turns out, communication is not the only access of interests that we should compare along the reason about performance of distributed training schemes. Even though interlayer model parallelism can drastically reduce the amount of inter device communication, it can also lead to device idle periods as devices wait for inputs to propagate through the devices, consequently limiting throughput. In this timeline diagram, I’m showing per device utilization with time on the X-axis. Blue boxes show forward passes, green boxes show backward passes, and the numbers in each box indicate unique identifiers given to particular inputs. We see that the backward pass for input one can be initiated only after the backward pass for… Can be initiated only after the forward pass for input one completely runs through devices one through four. We observed that large parts of the timeline diagram are grey, indicating idle periods.

This form of resource utilization leads to low throughput. Pipelining is an idea that has been proposed by recent work like [Pipedream NG Pipe] to address this issue. A batch of input is split into smaller microbatches and execution is then pipelined across these microbatches. In this particular example, I’m splitting a batch consisting of four samples into four microbatches. Device one no longer has to wait for the forward and backward pass or microbatch one to complete before starting on the forward pass of microbatch two. Execution for microbatch two thus can be started as soon as the forward pass for microbatch one is completed. The optimizer can be stepped and weight parameters updated when all microbatches in a batch are completed. Note that every device is idle for a couple of time units at the start and end of a batch. We call this a pipeline flush as devices wait for inputs to either be injected into the pipeline, or then subsequently drained out at the end of a batch.

We can know exactly quantify the amount of time different devices spend idling. We call this the pipeline bubble and it’s equal to P minus one forward passes and P minus one backward passes where P is the depth of the pipeline. In this particular example, P is equal to four. The reason why the pipeline bubble is equal to P minus one forward passes and B minus on backward passes is because micro batches first need to be injected in to a pipeline and then drain. If T sub f is the time taken for the forward pass of a single microbatch and T sub b is the time taken for the backward pass of a single microbatch, then if a batch has M micro batches, the ideal amount of time taken by these M micro batches is just going to be M times Tf plus Tb.

The total time spent in the pipeline bubble is going to be P minus one times Tf plus Tb. And thus the fraction of ideal total time that’s spent in a pipeline bubble, is just going to be the ratio of those two quantities, which is equal to P minus one divided by M. Thus we see that at large batch sizes we’re M or the total number of micro batches in a batch is large, then the fraction of ideal time spent in the pipeline bubble is going to be small. We can see these results in action. I’m going to show the results of a rescaling experiment where the size of the model is going to be proportionally increased as the pipeline parallel size increases. I’m going to show results for two different batch sizes. The Y-axis shows per GPU achieved throughput. If you have perfect linear scaling, you would expect this line to stay pretty flat, horizontal, and parallel to the X-axis.

We see that as we increase the size of a GPT model from 16.2 billion parameters to 122 billion parameters, the throughput per GPU decreases. We see that this decrease in throughput is particularly stark for small batch size where the effect of the pipeline bubble is much more pronounced. On the other hand, we see that far the larger batch size of 128, throughput remains fairly flat indicating that pipeline parallelism scales fairly well. We thus see that different modes of models parallesim have different trade-offs. While tensor model parallelism features expensive collective communication, pipeline model parallelism features much cheaper point to point communication. Tensor parallelism on the other hand always keeps devices active whereas pipeline parallelism can result in device idle periods or pipeline bubble events.

These tradeoffs get much more complicated when we want to compose tensor and pipeline model parallelism. Additionally, we can add data parallelism into the mix and these yields yet more trade-offs. It turns out that we can schedule the forward and backward passes of various micro batches as well as the computations corresponding to different layers in other ways as well, leading to other trade-offs. In particular here, we can split a model into a larger number of stages than we have devices. So instead of mapping a model into four stages, and then housing those four stages on four different devices, we can instead split a model into eight stages and then map those eight stages to four devices in an interlead fashion.

In the timeline diagram on the bottom of the slide, we show such an interleave scheme. The dark color showed the first stage on the particular device and the light color show the second stage. Blue as before shows forward passes while green shows backward passes. We see that with this interleave scheme, the total amount of time spent idle periods at the start and end of the batch is much smaller. This is because the total amount of time spent in the forward and backward pass for each stage is now cut in half. Consequently, the pipeline bubble size is also reduced in size by two X. However, we see that this interleave schedule does not come for free. Since now every device has two stages and control needs to pass through these two stages at different points in time. The amount of communication needed with this interleave schedule is also going to be doubled. And thus we see that this new interleave schedule presents a trade-off while it reduces the pipeline bubble size it also increases the amount of communication.

We can now quantitatively see the impact of this interleaved schedule. As before, I’m showing per GPU throughput on the Y-axis. On the x-axis now I’m showing the batch size. All throughput are shown for 175 billion parameter model using 96 A100 GPUs. We see the at small batch sizes where the regular non interleave schedule suffers from large pipeline bubble sizes. The non-interleave schedule improves throughput by as much as 1.5X. Now we see that as we increase the batch size, the impact of the pipeline bubble for the venular non-interleave schedule decreases, and thus the gap between the two schedules is closed. We thus see that this configuration space of degrees of pipeline, tensor, and data parallesim, as well as the pipelines schedule can greatly impact the amount of communication, the size of the pipeline bubble, the memory footprint, and consequently the performance of distributed configurations that we see in practice. Are there hyper parameters such as the global batch size as well as the microbatch size also greatly impact the performance we observe. For more details on what the various trade-offs are for these different configuration parameters, please look at our preprint on archive.

Finally, we found that we also needed to implement various domain specific optimizations to target both the computation and communication pipelines of our distributed implementation. To obtain good performance on the latest hardware, we found that we needed to optimize the computation graph by manually fusing certain memory-bound kernels as well as using PyTorch JIT functionality to reduce the number of memory accesses in an automated fashion for various operators in our model training graph. We also found that when composing pipeline and tensor model parallelism certain types of communication proved to be redundant, and we can implement an optimization that we call the scatter gather optimization to reduce the number of bytes that need to be sent between devices on different servers.

As before, I’m going to show per GPU throughput on the Y-axis and batch size on the X-axis. This is going to be for the GPT three model with 175 billion parameters using 96 A100 GPUs. We see that across the spectrum of batch sizes, the scatter gather communication optimization can improve throughput by between 10% to 15%. Our implementation is built on top of the mega tron training system for transformer based language models. Our implementation is build on top of PyTorch and supports various transformer based models like GPT and BERT. While a lot of the experiments that we presented in this talk were on large GPT models, we found that we were able to obtain good performance on a large number of smaller models as well. For example, BERT-Base and BERT-Large as well as GPT two. Our implementation also obtain good performance at small scale factors using small number of GPUs.

Our implementation supports different types of models parallelism, such as tensor and pipeline model parallelism, as well as data parallelism and different types of precision as well. For example, fp16, as well as fp32. Repositories mean read me file has instructions on how to train these models, and the main driver files are pretrained_gpt.py and pretrain_bert.py. With that I want to conclude. In this talk, I showed how we can compose pipeline and tensor model parallelism with data parallelism to train transformer models with up to a trillion parameters at scale with thousands of GPUs with extremely high efficiency. The net aggregate throughput of our training was 502 petaflops per second, which is 52% of theoretical peak. We have a pre-print discussing these various insights as well as more experiments on archive and I encourage you to check that out. With that, I’m happy to answer questions.

Deepak Narayanan is a final-year PhD student at Stanford University advised by Prof. Matei Zaharia. He is interested in designing and building software to improve the runtime performance and efficienc...

Read more