Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters

Download Slides

XGBoost is one of the most popular machine learning library, and its Spark integration enables distributed training on a cluster of servers. In Spark+AI Summit 2019, we shared GPU acceleration of Spark XGBoost for classification and regression model training on Spark 2.x cluster. This talk will cover the recent progress on XGBoost and its GPU acceleration via Jupyter notebooks on Databricks. Spark XGBoost has been enhanced to training large datasets with GPUs. Training data could now be loaded in chunks, and XGBoost DMatrix will be built up incrementally with compressions. The compressed DMatrix data could be stored in GPU memory or external memory/disk. These changes enable us to train models with datasets beyond GPU size limit. A gradient based sampling algorithm with external memory is also been introduced to achieve comparable accuracy and improved training performance on GPUs. XGBoost has recently added a new kernel for learning to rank (LTR) tasks. It provides several algorithms: pairwise rank, lambda rank with NDC or MAP. These GOU kernels enables 5x speedup on LTR model training with the largest public LTR dataset (MSLR-Web). We have integrated Spark XGBoost with RAPIDS cudf library to achieve end-to-end GPU acceleration on Spark 2.x and Spark 3.0. We achieved a significant end-to-end speedup when training on GPUs compared to CPUs. Accelerated XGBoost turns hours of training into minutes with a relatively lower cost. We will share our latest benchmark results with large datasets including the publicly available 1TB Criteo click logs.


 

Try Databricks

Video Transcript

– Hey, hello everyone. My name is Rong, I’m with my colleague Bobby here, we’re both from NVIDIA. Today, we’re going to talk to you about running XGBoost Training on Spark with GPU acceleration.

Scalable Acceleration of XGBoost Training on Spark GPU Clusters

So we basically have two parts to this talk. In the first part, I’m going to just tell you a little bit about XGBoost itself, and do kind of a deep-dive topics. One is on Gradient-based sampling, the other is on learning to rank. And then I’ll hand off to Bobby, we will talk about running XGBoost training on different versions of Spark 2.x and Spark 3.0.

Okay, so the first bit, a little bit about XGBoost.

You know, if we’re sitting face-to-face at this point I would probably ask you if you’ve used the XGBoost before. But hopefully if you’re watching this talk, you’ve at least heard of XGBoost. It’s an open source library, you can use to build gradient boosted decision trees. It’s a pretty popular library, it’s used in many technical competitions and data science challenges. There’s quite a few companies that use it in production for their real products. It supports different kinds of machine learning programs, from regression to classification to learning to rank. Which we’ll take about more in detail and you can even define your own custom objectives, so you can define your own problems, and solve it that way.

Distributed XGBoost

Since this is a Spark AI summit, a lot of you are probably interested in skilling up your machine learning problem on a large cluster.

XGBoost itself has pretty good support for distributed training. You can use it in different cloud environments or Hadoop Yarn clusters.

Of course, the best environment to run distributed training is using Spark, but it also supports running other dataflow systems like Flink.

XGBoost GPU Support

So scanning out to a cluster is one way of training on a large data set, but sometimes it can be very slow. That’s where GPU acceleration comes in with the same size of cluster, but if you add in GPUs, you can potentially do the the training a lot faster. XGBoost has GPU support, actually for a while now. It’s pretty easy to use if you’re training on CPU, typically you use the hist algorithm, build histogram to build trees. The GPU equivalent is called gpu_hist. So basically, you just need to swap the tree method, from hist to gpu_hist and off you go and you can actually see it speed up during training on GPUs. Okay, so now, I’m going to dive a little bit deeper into one of the two topics. The first one is Gradient-based sampling.

Out-of-core Boosting

When you’re training XGBoost models, even for animation learning programs, typically, you want to fit your whole training data set into the memory. But if you’re doing this on the GPU, GPU memory is typically smaller than the main system memory on the cloud environment, in the typical deployment. I think the current generation of GPUs have 16GB of RAM, but if you’re speeding up a virtual instance, even with just one GPU, you can typically get 60-80GB of system RAM. So you’re at a disadvantage if you want to fit your whole dat set into GPU memory.

So even when you’re speeding up a good sized Spark cluster, if you want to fit everything into GPU, it might still not be possible.

Naively, what you can do is, say I put all my data in the system memory or on disk, and I will stream my data in during training. But the problem is you have to copy your data over the PCIe bus, which typically has a much – it’s pretty fast now, with current generation, but it’s still much slower than the memory access, right. Especially compared to GPU memory, which has much higher bandwidth and lower latency, compared to the PCIe bus.

So, to work around this problem, people have tried many different things. One approach is to sample your data when you’re constructing a tree. At the beginning of this tree construction, you can sample your training data and then, we keep the sample data in memory. So that way, you can reduce the size of the data that you have to keep in memory.

And then, as you’re constructing the tree, you should note you can use the sample data instead of the full data set. The problem is, if you’re just doing this with the most straight forward approach, just do random sampling or uniform sampling, typically it would require at least half the data that you have to keep in the sample to get reasonable accuracy. So you’re not really getting much, maybe you can double the size of your training data, but if you have 10, 20TB of training data it still requires a lot of GPUs and a very large cluster.

Gradient-based Sampling

So to improve on that, the one thing we can try is to use gradients to… as, sort of, measure to do the sampling. As you’re constructing this series of trees in XGBoost, once you finish the previous tree, you can run your training data through the current model to get the output from your objective function, and you can calculate the gradient, which are just the first outer and second outer derivatives.

And then based on the gradient, you can say, I’m going to sample the data with probabilities proportional to the sum measure of the gradient. People have tried different approaches, one is called Gradient-based One-side sampling (GOSS). It’s used in another library called light GBM.

And more recently, there’s a new algorithm called Minimum Variance Sampling (MVS), that was implemented in CatBoost.

For this work, we actually implemented MVS inside XGBoost on GPUs. So with these type of approaches, you can reduce the size of your sample and still get reasonable accuracy. You can go down, maybe as low as just 10% of your data, and still get a reasonable model.

Maximum Data Size

So once we implemented this algorithm, we tested this synthetic data set we generated, it has 500 columns. The data is generated randomly. We tried it on a NVIDIA V100 GPU, which has 16GB of memory. So if you keep everything in memory, for this particular data set we can fit about 9 million rows in a single GPU. When we enable this out-of-core mode, even if you’re keeping everything inside the memory, because we’re building the transform data incrementally we can still fit slightly more number of rows. But it’s not too much. Once you start sampling, for example if you just sample 10% of the data, you can get 85 million rows in a single GPU. It’s more or less out of magnitude, bigger.

Training Time

You see, in terms of the training time, compared to the CPU, when you’re doing CPU training, whether you’re keeping everything in-core or doing this out-of-core approach, it takes quite some time, compared to the GPU approach. So with GPU, it’s interesting even if we’re doing just they out-of-core approach, and without sampling, it’s actually slightly faster, because we’re only doing the incremental data transformation, in the beginning of the training, and then we keep everything in memory after that. So it’s actually slightly faster than just doing the transformation on the whole data set. And then once we start doing sampling, because there’s some cost associated with sampling at the beginning of each tree, so the training time is slower, but in terms of accuracy it’s actually pretty much equivalent for some combinations of hyperprime images we gradually see some improvement in accuracy.

Model Accuracy

And here we show the training curve of out-of-core algorithm with different sampling ratios. You can see, they’re pretty equivalent. You see some slight drop when you sample too much it can go all the way down all the way to 10%.

Okay, switching gear here. I’m going to talk about the next topic, which is learning to rank.

Learning to Rank (LTR) in a Nutshell

So, what is learning to rank? I think a good example to think about is search, you go to Google and type in Spark.

I tried this last night, when you type in Spark there’s 500 million results. So the problem is, how do you determine which result you show to the user, once you type in this query.

Google famously has this page rank algorithm.

It’s actually not a machine learning algorithm, it’s just like another graph algorithm. So based on the incoming link, they try to figure out which pages are more important. So you can also do the similar ranking approach with machine learning, using XGBoost. Once you have the pages that are related to a query, you can group them based on the relevance of domains, sub-domains, et cetera. Within each group, we can use machine learning to determine the ranking.

LTR in XGBoost

With XGBoost, basically what you want to have is a supervised training data set, so you know the relative ranking between any two URLs.

Once you have that, then you can iteratively sample these pairs and minimize the ranking error between any pair.

LTR Algorithms

So there are 3 different algorithms supported on the GPU. The first one is the default, it just does the Pairwise comparison. Then there are two more mean Average Precision (mAP) and normalized Discounted Cumulative Gain. They’re variations on the Pairwise approach, it has further weighting of two instances.

Enable and Measure Model Performance

To train this Learning to Rank model on GPU, I guess, first you have to enable the GPU histogram method. The objective function of the two is something that’s the ranking objective and when you’re evaluating a model, it also needs a metric that supports the ranking objective. There is a bunch of different objectives you can use, to try to optimize.

Performance – Environment and Configuration

For performance evaluation, we use the Microsoft benchmark dataset. I won’t go into too much detail here, you can read all the information on the slides.

Performance – Numbers

So the final results, once we enable everything on the GPU, you can see results of three different algorithms, GPU is a lot faster, 10 to 20, even close to 30x than the equivalent CPU approach.

Now, I’m going to hand off to Bobby, who’ll talk about running XGBoost on the different Spark versions. – Thanks Rong for the introduction of the XGBoost internal things. From our stock, we can see a pretty big speed up on Learning to Rank by leveraging GPUs. It’s actually 20x speed up. That’s quite amazing. But what I would like to say, is that we can not only get good speed up from Learning to Rank, but we can also get good performance from classification and regression, provided by XGBoost, by leveraging GPUs.

How to use XGBoost to train already existing data? The first thing I think we may need to do, is converting the existing data to numeric data, which is the only data format that is accepted by XGBoost.

In reality, our existing data may not be numeric, which means we need some tools to do the ETL. The question is, which tool?

I guess you have guessed, it is Spark. I suppose you have knowledge of how popular Spark is and how powerful Spark is, about doing so I don’t wan to talk much about why Spark.

XGBoost4j – Spark

Fortunately, the XGBoost community has supported XGBoost on Spark, the XGBoost4j Spark project. The XGBoost4j Spark project aims to integrate XGBoost and Spark, by fitting XGBoost to Spark’s machine learning library framework. So with the integration, it not only uses the high-performance of XGBoost, but also leverages the powerful data processing engine of Spark. So, you can benefit a lot from the XGBoost4j Spark project.

XGBoost + Spark 2.x + Rapids

Looking at the whole machine learning workflow, if you are working on machine learning, you probably know training models and doing predictions are a pretty small portion of your work. Most of your time will be spent on the data preparation, just like the picture shows. On the left, you have the traditional CPU powered workflow. Data engineers start their morning, by writing some code to tell the data engine to do feature key apps.

They start it running on big dataset, then they go and get some coffee. After a while, they realize “oh, I forgot to add a feature, so I need to change my code a little bit and restart it. And wait”.

And another while, there were unexpected null values and some missing data somewhere. So I need to change the code and restart it again. So it’s not just a “fire and forget” process, it’s actually an entirety of a process. All this together is pretty slow, because you are not leveraging GPUs. So how do we solve this problem? And how do we accelerate ETL by GPUs, so we can have faster and machine learning workflow. Yes, Rapids cuDF is the solution. Actually, Rapids cuDF has two components: libCudf and language bindings. LibCudf is a C++ library, which provides low-level APIs to accelerate operations, like sorting, filtering, joining, aggregation and so on.

And on top of that, different language bindings for Java, for Python.

So, with Rapids cuDF, we came up with a solution to accelerate XGBoost end-to-end on Spark 2.x. First we loaded the data into the GPU memory. Second, the GPU memory resources are pretty precious, so we need to ensure to load more data, as much as possible. The chunks loading can achieve this goal. The third is converting the cuDf columns to the sparse, Dmatrix. Finally, we fed the Dmatrix to the XGBoost and train and get the model.

Training on GPUS with Spark 2.x

And this slide demonstrates how to train XGBoost for both CPU and GPU. The left code is for the CPU. First we need to know the data to the data frame, and then, we need to vectorize the features. And finally we fit the data frame to the XGBoost class variety.

On the right column, we customized the data reader, to know the data into GPU data set. And we Skipped the features vectorise, because we already loaded the data into the GPU memory, with column format.

Instead we specify the features column lengths when building Dmatrix. And finally, we fit the GPU dataset to the XGBoost classifier or regressor.

XGBoost # Spark 2.x # Rapids

We did some benchmarking for mortgage data. Totalling 190GB. You can see from the chart, the shorter is better, obviously. We have 34x speed up in training time and 6x cost saving. That’s quite a significant number.

Before talking about XGBoost and Spark 3.0, if you remember, we have a customized GPU data reader and a GPU dataset, to bring cuDf to and accelerate XGBoost on Spark 2.0. We cannot directly use the data framework of Spark, that’s because of poor GPU support on Spark 2.0. But, as Spark 3.0 is coming with GPU scheduling support and cumulative processing on GPUs, so it is possible for Spark 3.0 to be accelerated by GPUs.

Actually, we did this in Rapids-plugin project. Rapids-plugin is actually is a project Spark plugin that leverages GPUs to accelerate processing, while Rapids libraries, for example cuDF RMM.

Seamless Integration with Spark 3.0

So, with Rapids’ plugin on Spark 3.0 there are low-end code changes. This is something that happens behind the scenes. In Initial Release, Rapids’ plugin can be used to accelerate Spark data frames, Spark SQL and Machine Learning/Deep Learning training frameworks.

Rapids Plugin

We’ve been working on to make this happen. Actually, there are two separate plugins we provided. The first is the computation ETL plugin, that works with SQL and DataFrame APIs and the other is Shuffle plugin. But I don’t want to talk much about Rapids plugin, because we have a narrow session to cover it. So I strongly recommend you watch that session.

XGBoost # Spark 3.0 # Rapids

So with Rapids’ plugin, we can accelerate the XGBoost and two-end, by GPU scheduling, GPU accelerator data reader chunks loading and once more, the operators can also be accelerated by GPUs. For example filter, sort, joining, and so on.

Training on GPUs with Spark 3.0

From the code you can see, we use building dataframe of Spark, to load the data into GPU. And we also skip the feature vectorise, because the Spark render, has loaded the data into the GPU memory, with column format.

We also specify the feature column lengths to XGBoost classifier or XGBoost regressor. And finally, we fix the dataframe to the XGBoost classifier.

XGBoost + Spark 3 + Rapids

We also did a benchmark for Criteo data 1TB.

We have 5.5x speed up in training time and 6.4x cost saving.

New eBook: Accelerating Spark 3

We provide a new eBook: Accelerating Spark 3. In this book, you’ll learn about the data processing evolution, from Hadoop to GPUs and the NVIDIA Rapids library. And what is Spark, what it is, and why it matters. Something about the GPU acceleration, dataframes, and Spark SQL, examples of end-to-end machine learning workflow GPU-accelerated with XGBoost.

And here are some links for XGBoost. So I think this is the last thing we have.


 
Try Databricks
« back
About Rong Ou

NVIDIA

Rong Ou is a Principal Engineer at Nvidia, working on machine learning and deep learning infrastructure. He introduced mpi-job support into Kubeflow for distributed training on Kubernetes. Prior to Nvidia, Rong was a Staff Engineer at Google.

Bobby Wang
About Bobby Wang

NVIDIA

Bobby Wang is a software engineer working on GPU acceleration for Spark applications. He holds an MS in Communication Engineering from the University of Electronic Science and Technology of China. Prior to spark related jobs, He worked on Android Apps/Framework for years at Qualcomm and Nvidia.