Bobby Wang is a software engineer working on GPU acceleration for Spark applications. He holds an MS in Communication Engineering from the University of Electronic Science and Technology of China. Prior to spark related jobs, He worked on Android Apps/Framework for years at Qualcomm and Nvidia.
XGBoost is one of the most popular machine learning library, and its Spark integration enables distributed training on a cluster of servers. In Spark+AI Summit 2019, we shared GPU acceleration of Spark XGBoost for classification and regression model training on Spark 2.x cluster. This talk will cover the recent progress on XGBoost and its GPU acceleration via Jupyter notebooks on Databricks. Spark XGBoost has been enhanced to training large datasets with GPUs. Training data could now be loaded in chunks, and XGBoost DMatrix will be built up incrementally with compressions. The compressed DMatrix data could be stored in GPU memory or external memory/disk. These changes enable us to train models with datasets beyond GPU size limit. A gradient based sampling algorithm with external memory is also been introduced to achieve comparable accuracy and improved training performance on GPUs. XGBoost has recently added a new kernel for learning to rank (LTR) tasks. It provides several algorithms: pairwise rank, lambda rank with NDC or MAP. These GOU kernels enables 5x speedup on LTR model training with the largest public LTR dataset (MSLR-Web). We have integrated Spark XGBoost with RAPIDS cudf library to achieve end-to-end GPU acceleration on Spark 2.x and Spark 3.0. We achieved a significant end-to-end speedup when training on GPUs compared to CPUs. Accelerated XGBoost turns hours of training into minutes with a relatively lower cost. We will share our latest benchmark results with large datasets including the publicly available 1TB Criteo click logs.