How (Not) To Scale Deep Learning in 6 Easy Steps

Deep learning sometimes seems like sorcery. Its state-of-the-art applications are at times delightful and at times disturbing. It’s no wonder that companies are eager to apply deep learning for more prosaic business problems like better churn prediction, image curation, chatbots, time series analysis and more. This talk won’t examine how to tune a deep learning architecture for accuracy. This talk will instead walk through basic steps to avoid common performance pitfalls in training, and then the right steps, in order, to scale up by applying more complex tooling and more hardware. Hopefully, you will find your modeling job can move along much faster without reaching immediately for a cluster of extra GPUs.

Watch more Spark + AI sessions here
or
Try Databricks for free

Video Transcript

– Hello, everyone, and welcome to Summit. This is Sean Owen. I am a principal solutions architect here at Databricks, and I have the pleasure of focusing on machine learning and data science when it comes to getting those things to work on Databricks, and sometimes work well and scale up. So we have a lotta conversations with a lotta customers about getting these types of jobs to really work well on Databricks. We’ve learned a lot with our customers along the way. And today I wanna share with you some insights we’ve gleaned about how to make deep learning work pretty well, what to do, and maybe what not to do when you’re trying to get these things to scale up to large scale data, as I assume we’re all trying to do at some point. So it goes without saying, deep learning just seems like magic sometimes. It does so many amazing things like, it can take a picture of your dog and look at a bunch of paintings by Kandinsky and produce a painting of your dog in the style of Kandinsky. This is style transfer. Or maybe, we’ve learned that it can help drive cars autonomously, which also seems pretty crazy. It can learn how to identify people and stoplights and do so well enough to drive down the road safely, which is amazing. And maybe on the slightly more unsettling ends, we’ve learned that deep learning can help you maybe transfer, literally, the face of one person onto the other while speaking. This here’s an example, getting the Mona Lisa to talk in the style of someone just talking in the room. These are deep fakes. So you know, it’s really no surprise that everyone’s reaching for deep learning, because it seems like such powerful magic, and a lot of the tools you need are free and open-source, frameworks like Keras and TensorFlow and PyTorch. And a lot of the models and how to build them, those processes are also open-source. You can download them off the internet and just try to run them, and that seems so amazing that everyone’s trying it these days, right? And you know, here’s a quote, “My deep learning, it just trains so fast, “and I don’t have time to even think or go get a coffee. “I just keep iterating at the speed of thought.” This was said by, really, nobody ever. I think if you’ve ever tried deep learning, you know, one of the first things you realize is that it’s extremely computationally intensive. It takes a lot of hardware. It can take a lot of time. It can take, even if you’re able to rent this hardware in the Cloud, it can get expensive, too. So you know, has this ever happened to you? You’re training a large model, you’re excited to see what happens, and you’re stuck at that output from here, like a Keras model, and it’s taking hours per epoch, and you know that it’s gonna be well into next week before the model finishes.

Has This Happened ta You?

This is depressing. You know, you are here. You’re probably checking this terminal every morning over your coffee, hoping it hasn’t failed. So this is a common problem, right? And I think our instinct as in lots of things in the Cloud is well, to maybe throw some more hardware at it. So you figure out you don’t need your CPU, or a big CPU, you need a big GPU. And you run the model on that, and it’s still not quite fast enough. So you reach for another GPU, and maybe another GPU, maybe another GPU, and you’re spending more and more, and it’s getting a little faster, but not nearly as fast as you might expect. It feels like something’s wrong. And indeed, something probably is wrong.

What’s Wrong Then?

From customer experience, I can say there’s probably two main reasons it’s going wrong. Number one, the category A is you’re just doing it a bit wrong. There’s something slightly funny about how you’ve set up the training process that’s causing it to be much slower than it needs to be. And number two, you’re throwing a lot of expensive hardware at it, but you need to do something a little bit differently to really take advantage of all that hardware that you’re renting here to really make the model go as fast as possible, the training go fast as possible. I’m gonna explore maybe three points in each of these categories.

Today: Tuning Not Architecture

Now, I wanna be clear today, we are not gonna talk about tuning the actual architecture of these models and choosing the right architecture. That’s its own talk, I’m sure, or many, many talks. Today, I just wanna talk about tuning an existing architecture for performance, getting an existing one to run as fast as it really should be for performance and making sure you’re not needlessly wasting time and money while training your neural net.

So let’s start with a simple problem to motivate some of these points and show off some of the points I’d like to make to you and knowledge I’d like to share.

Caltech 256 Image Classification

So for this, these simple examples, we’re just gonna take a simple, multi-class image classification problem built on the Caltech-256 dataset. It’s got about 30,000 images, and there’s about, it’s actually 257 categories, strangely enough. And we’re talking about tens of gigabytes of data here. It’s not a huge data set, but it’s certainly not trivial.

We are going to apply these models using Databricks, but everything I’m gonna show here really works equally well outside Databricks as well. I’m gonna use TensorFlow, TensorFlow’s Keras, to set up this model, and we’re going to start with a pretty simple transfer learning setup, taking an existing image classification network, like Xception, and simply put a Dense layer, a fully connected layer on top of that with maybe some Dropout and use that to try to train and learn an image classifier, and see how we go from maybe an initial slow model to one that’s much faster and much more scalable. So I will say that the full code listings will be available afterwards. You don’t have to take notes, and there’s a little more to it but not much more. And this code listing just shows maybe the preliminaries here. As I mentioned, we need to maybe do a little bit of pre-processing to get these images into a standard size. So we will define a function to read the image, crop it, and resize it with some borders down to the standard epoch size that Xception wants, which is 299 by 299. And you can do all that with Spark. So here is a bit of Spark code to read the Caltech-256 images in a distributed way, apply this transformation, and then write the raw bytes of the resulting images not as individual files but as Parquet files, and we’ll see why in a minute. But this is just some of the preliminaries, the preliminary set-up here before we start training them all. So step one, tip one is, well, it’s sort of use a GPU, but it’s almost a point zero, because really for any non-trivial model, non-trivial size with a non-trivial amount of data, you’re probably gonna get enough value out of a GPU to justify using it. And as we know, you can easily rent these in the Cloud these days, so there’s really not a strong reason not to go ahead and use it. So I would say start small. You don’t need to reach for the biggest, most expensive GPU or multiple GPUs initially. We need to start with something simple, like, maybe 1 x K80 GPU. These are available on all major Cloud providers. I’d also say work in memory if you can. So if your dataset fits in memory, don’t make it more complicated. Just load it into memory. That’ll be fastest, of course. And if that means maybe taking a sample of your data at first to fit that into memory for the early iteration, that’s probably best to. No need to spend time learning, waiting, training on your whole dataset in these early iterative stages when you’re even figuring out how the model is working.

This is maybe the most simple model you could build. As promised, we load Xception, we add a tensor layer on top of it in Keras, compile the model, fit it, and see what happens. At first glance, it looks like we got a pretty good result from the simplistic model, about 91.7 percent accuracy after 60 epochs. And so they’re taking 65 seconds each. That’s a little over an hour. Now, that’s unfortunately not quite true, because when we evaluate this on the held-out test data, we see the accuracy is only maybe 75.9 percent. Classic overfitting. You may have seen this immediately. But it’s a little more insidious than that. It’s not just that we’ve overfit, and maybe the model is not going to generalize as well as we think. And so we wasted a bunch of time and money getting there.

We need to solve this problem, and we can do much better by avoiding all this needless work that we were doing just to make the model overfit. That leads to point two, use early stopping. Very important. I almost can’t think of a context in which you wouldn’t wanna use early stopping. What is it? Early stopping is, it’s an idea that you can easily access in things like Keras or PyTorch as a simple callback.

Use Early Stopping

The idea is we wanna stop training when the model’s not getting any better. We judge that on this held-out validation test site here, of course. And so during training, we can, after each epoch, stop and look at how good the model is, according to the held-out validation set, and simply stop when it hasn’t improved according to that, the test set and maybe a couple epochs here. Now, that does two good things for us. Number one, we stop much earlier, so training finishes faster. And number two, we might get a slightly better model out because we have avoided some of the other overfitting. So in the second example, all we’ve done is add an extra call back in here from Keras that cuts off the training, in this case, after only about 14 epochs, not the 68 we’d arbitrarily chosen in the first instance. That means the whole training time came down to 18 minutes. So in this case, we got roughly about the same actual held-out accuracy. In other cases, you might find that the model’s actually a little better at this phase. One thing to note, do set restore best weight equals true on the Keras callback to make sure the final model you got was that best model, not the one, the last one it got after the patience period where it was starting to get a little bit worse. So that’s already a pretty big win from one line of code, and hopefully you’re already doing that, but I certainly have seen many people not bother with this step and simply train for a hundred epochs. And one comment I’d like to make here is epoch’s end up being a little bit arbitrary. As we train deep learning models, we’re really going to keep pushing batches of data through the model over and over and over until it’s just not getting any better. So how many batches make up an epoch is a little arbitrary, and I don’t think we should focus so much on the number of epochs. Epochs still matter as like, a unit of training, perhaps, but they’re not really a parameter to tune in itself so much.

Right. Now, at this point, you might also take a look at your GPU utilization. You can do this in Spark via its metrics API in here.

GPU Utilization?

This is a screenshot from Ganglia metrics built into Databricks. This is not bad. So this small, K80 GPU was about 80 or 90 percent utilized, which is pretty good. It’s better than 70, better than zero. But of course what’s cooler than 80 percent? A hundred percent. We’d like to get all the value we can out of this relatively expensive hardware we’re renting in the Cloud. So the next thing is to think of is how can you take advantage, usefully, of the extra capacity that’s left idle in the GPU?

This brings to point three, try and max out that GPU, maybe by cranking up or modifying the batch size and learning rate, too.

Max but GPU with Larger Batches

So what do we mean there? You may have noticed if you look carefully at that code listing, we were using the batch size of just two images, which is pretty small. Only two images at a time are going to the GPU. It turns out the GPU could have easily handled a little more than that, at least at N parallel. And so it probably is useful to increase that batch size, give it much more work to do per batch, to really fully utilize the wide parallelism of these GPUs. Here in this example, we’re gonna turn the batch size up to 16, which feels a lot more reasonable. Now, the point isn’t just to make the GPUs busier. We wanna make them busier, but get something out of it. We did get something out of it in speed because we cranked through the data in bigger batches. Obviously, we used the GPUs a little better, so the epochs complete more quickly. But it also means that our estimates of the gradient resulting from these batches are a bit better, a bit more accurate. This is true only up to a point, but certainly at this point, I think 16 images is gonna give us better gradient updates than two. Because we do that, we can probably think of increasing the learning rate as well, take more bigger steps through the search space but with more confidence because we’re basing them on bigger batches. That helps us because we might jump forward more quickly to that final point where the model converges that much more quickly. As a side note, I did pick up a little rule of thumb from a research paper, that if you increase the batch size by about N, you might increase the learning rate by a factor of about square root of N, as a rule of thumb. But it really depends. And both of these things can be set too high. So you have to watch how it affects the training. But in this case, the benefit was pretty darn good. The training completed in fewer epochs and just about nine minutes for about the same accuracy, and that’s, again, because we were able to go through the epochs more quickly but also because we were able to get to where we were going faster in fewer epochs and terminate the training.

So again, another small bit of tuning can add a lot. We’ve gone from 65 minutes down to nine minutes, just with a few tweaks, for about the same accuracy, maybe even slightly better. But can we do better in even a simple example like this? So at this point, I’m gonna switch gears and talk not so much about easy fixes but about ways to take advantage of more data and scale up to use hardware that you’re gonna have to throw at the problem when the dataset gets bigger. One of the first things we’re gonna talk about is how to deal with large datasets that will not fit in memory. One of the key tools you may wanna look at to help with this setup is a tool called Petastorm. So first, what’s Petastorm, right? Well, it’s an open-source project from Uber, and to maybe oversimplify, it’s a parquet to deep learning dataset streaming framework. It helps get data off maybe distributed storage in a structured way from a parquet file into a forum that deep learning frameworks like PyTorch and TensorFlow and Keras can use, and do that efficiently with some careful access of the data and taking advantage of parquet and local caching and things like that. The nice thing about this is it does integrate with most deep learning frameworks, and it plays nice with Apache Spark, too, which is good if you’re using Spark for your ETL and for your data preparation parts as well. There’s no logo, but here’s the GitHub site for you. Right, so what problem is Petastorm really solving for us? Well, you’ll recall that we started out in this problem with just a 10 percent sample of data. And this dataset, it may be 10 or 20 gigabytes of image data. It probably could have fit into memory, but let’s pretend it didn’t because this could have easily been a problem where we had hundreds of gigabytes of data or terabytes of data. And it’s probably, it’s gonna be difficult or impossible to fit all that into memory. So what we’re going to need to do instead is read the data in an efficient and principled way off distributed storage and feed it into our training framework. Now, that’s easy enough. The risk here is that that’s going to slow down the training, not just because we’re reading more data but because we’re reading it remotely, and that can introduce bottlenecks and so on. Petastorm helps remove some of those bottlenecks, but as you’ll see, it’s still gonna be a persistent issue going forward. So in this setup, the core code is not that different at the right, and still in the same core training loop, but we’ve instead made some calls to Petastorm, or some wrapper functions around Petastorm, that load data at a certain path here as datasets. It’s a TensorFlow object that works here as like, an infinite iterator over the data. So this fit process is going to use that data fed to it by Petastorm to train over and over. And you’ll notice because this is sort of an infinite iterator over the data, there’s no real notion of one epoch of data anymore. We have to tell it how many steps constitute an epoch. I think that just kind of highlights how epochs are a little bit arbitrary. They’re useful as a unit of measuring comparison, but maybe not as an amount of training that you wish to ask it for. So in this case, if you run this, now number one, it goes slower. We’re finishing epochs in more like 680 seconds. But of course, we were using 10 times as much data. So we would expect it to be slower. And it’s a little more than 10 times slower though, because there’s this a little, still some IO overhead, and there’s some, a little more cost to using this remote data, but it’s not much worse. On the upside, because we trained this model on much more data, we got an uplift in accuracy from about 76 percent here to about 83 percent. So that’s good, and maybe that alone is worth it to paying that cost of 11 times more GPU hours to get that 7 percent boost in accuracy. Now, if you look at the GPU utilization in this training run, it’s good.

GPU Utilization?

It peaks up to around a hundred percent, but it’s more spikey. And this, some of this is inevitable, but if you do, in your training process, see that your GPU utilization is heavily spiking, that typically does mean you’ve got some bottleneck between batches that’s slowing down the training process. And that is often IO. And so if you’re seeing that problem, and you’re manually loading data off distributed storage, you might look at something like Petastorm that can help maybe iron out some of those IO bottlenecks. To some degree, this is going to be a little bit inevitable.

Now, I suppose one intermediate point here is once you’ve maxed out a small GPU, try a bigger GPU if it’s still not going fast enough or you want it to go faster. Certainly, most Clouds have V100 GPUs that are much bigger than the K80s. So that might be your next step. That would be pretty much the same story, just with a bigger GPU. You would tune up the batch size to take advantage of the parallelism. Maybe you’re using the bigger ones for it to fit bigger batches of memory and so on. But at some point, even one large GPU may not be enough for you. Maybe you have a big enough problem, big enough data, big enough network, that you wanna throw more hardware at it. You wanna throw multiple GPUs at it. There is a straightforward but not quite scalable way to do that, and there’s a slightly more involved but more efficient way to scale called Horovod. So I wanna talk to you about both briefly.

Use Multiple GPUS

Number one, if you’re a Keras user, you know that it has a pretty simple way to wrap up a model, and this wrapper that causes the training to split up across GPUs. It’s called multi GPU model. If we’ve got a batch size of 16 here, and we have, for example, an eight GPU machine, it’s gonna chop that batch up into chunks of two, pass them to each of the GPUs, and have them train one batch, and then combine their gradients to proceed. Now, this obviously gets you some parallelism, but if you rerun this as is, just like this, you’ll see that the whole training time comes down from about 630 seconds down to maybe 270 or so. That’s good. We definitely gotta speed up there, but it’s nothing like an eight times speed up. There’s a couple reasons for that. One of the reasons is that the effective batch size that each GPU’s processing is too small, again. We’re going back to it, to where we started here, and each one’s processing too little data to really keep it fully utilized. But the other problem is that the way this parallelizes the batch across GPUs is not necessarily the most efficient way. It’s a simplistic way, and you’ll find the GPUs end up waiting a lot on each other to communicate with each other and share those gradients even when they’re all on the same machine. It seems like we should be able to expect more and do better than this. So even if we crank up the batch size to 256 so that each of them are using 16 again, you’d find that this still isn’t quite achieving an eight X speed up, and it feels like we should be able to, you know? We wanna fully utilize these eight GPUs. So if you need to do that, you probably wanna reach for a tool called Horovod. Horovod is another open-source project from Uber. It’s a distributed, deep learning optimizer. Really drops in or on top of most existing code in PyTorch or Keras and swaps out their optimizer for one that uses an algorithmically more efficient means of trading those partial gradients amongst devices, and can take advantage of some more specialized hardware under the hood to make that gradient sharing quite fast. It can help speed up multiple GPUs on one machine, or as we’ll see at the end, multiple GPUs across multiple machines. So if you invest a little more work into using Horovod in your training code, you should be able to get closer to that linear speed up.

And here’s what that looks like. It’s a little hard to see now, I think, but the core code is the same, the Petastorm wrapper is the same, and we’ve gotta wrap this up in a function that does a few things like initializes Horovod, replaces the optimizer we’re using with the wrapper from Horovod, and a few other things like making sure that only some callbacks happen on one of the workers, not multiple ones. But it’s otherwise pretty simple. Actually, the only hard part here is running Horovod. But in Databricks, at least, we have a utility called Horovod Runner that will run Horovod for you. It’ll run the workers, set them up and all their MPI configuration and drivers, and run that as a spatch, excuse me, as a Spark batch job that uses its barrier mode to make sure all these workers schedule together and they live and die together. So thankfully, if you wrap it up, this utility code will just do the rest of the work for you. And in this case, we ask it to run this training code on eight GPUs that are all here on the same machine. If you do that, you do get something more like an eight X speed up from them. Now, this whole thing finishes in 12 minutes, and that’s about as fast as we were going when we had one GPU and a 10 percent sample. And now I’m sure we’re using eight GPUs, but we’re on a hundred percent of the data, and yeah, we’ve arrived again in about the same time at 80, about 83 percent accuracy. So that’s pretty good. And I think this is probably where most people would get to. It’s often you’ll find that your training job is big enough to keep a bunch of GPUs on one machine busy, but you may not need more than the 16 GPUs or something that are currently available on some of the biggest machines. But not always. Sometimes you need multiple GPUs across multiple machines because you need 32 of them or 64 of them. I would not reach for this first, but you can do this as well. For the largest scale training jobs, like the type that maybe the likes of Google and Facebook do, you might need that much hardware. The good news is it’s really no different to run this across multiple machines with multiple GPUs. You wrap it up in Horovod in the same way. You can also equally well ask it, “Please run this job across,” well, in this example, I tried eight machines with one GPU each. Now, on that net we’d expect that to be a little slower because the GPUs are now physically separated across the network and not on the same machine. And indeed, you do sorta see that this same training process takes about 17 minutes instead of 12 minutes. You don’t lose a lot, but you do. So you’d only do this if you really needed more GPUs than one machine would allow. The benchmarks show that the Horovod scaled up in this way gets you about linear scaling in maybe, you know, 80, 90 percent of linear up to 64, 128 GPUs. But I don’t think that particular limit matters so much because I think before, for most problems before you get to that large scale, you’re probably gonna hit other bottlenecks. For example, as you crank up the number of GPUs, remember that your batch is getting subdivided in smaller, smaller sub batches, and this gets harder and harder to keep those GPUs busy because they’re doing little slices of work. You can turn the batch size further, you can turn the learning rate up further to try and take advantage of that, but you’ll hit some diminishing returns and some limits to how big the learning rate can be and how the batch size can be. So in this particular problem, I found that for this medium-sized problem, it really didn’t make sense to use 16 or 32 GPUs. I just wasn’t able to take advantage of that hardware enough and turn my learning rate up enough to make it make sense. That might be the practical bottleneck, but it probably won’t so much be the actual, physical hardware scale issue.

In closing, if you looked at GPU utilization in this case, you would find that indeed they’re all pretty well utilized.

GPU Utilization?

There’s still a little bit of spikiness, there’s still some inevitable overhead in an IO reading the data and also waiting for the GPUs to communicate with each other. But it’s fairly minor at this point.

So thank you. I hope that’s been useful. I hope I’ve shared a couple easy tips you can apply to your deep learning jobs, and I hope I saved you a couple hours or days of training time, and I hope that as you scale up your problems to implement some of these same magical use cases that we see in the press that you have the tools available in things like Petastorm, Horovod, TensorFlow, Keras, PyTorch, to make it happen. We would welcome your feedback.

Watch more Spark + AI sessions here
or
Try Databricks for free
« back
About Sean Owen

Databricks

Sean is a principal solutions architect focusing on machine learning and data science at Databricks. He is an Apache Spark committer and PMC member, and co-author Advanced Analytics with Spark. Previously, he was director of Data Science at Cloudera and an engineer at Google.