Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA and NCCL

Download Slides

Data science workflows can benefit tremendously from being accelerated, to enable data scientists to explore more and larger datasets. This allows data scientist to drive towards their business goals, faster, and more reliably. Accelerating Apache Spark with GPU is the next step for data science. In this talk, we will share our work in accelerating Spark applications via CUDA and NCCL.

We have identified several bottleneck in Spark 2.4 in the areas of data serialization and data scalability. To address this we accelerated Spark based data analytics with enhancements to allow large columnar datasets to be analyzed directly in CUDA with Python. The GPU dataframe library, cuDF (, can be used to express advanced analytics easily. Through applying Apache Arrow and cuDF, we have achieved over 20x speedup over regular RDDs.

For distributed machine learning, Spark 2.4 introduced a barrier execution mode to support MPI allreduce style algorithms. We will demonstrate how the latest Nvidia NCCL library, NCCL2, could further scale out distributed learning algorithms, such as XGBoost.

Finally, an enhancement of Spark kubernetes scheduler will be introduced so that GPU resources can be scheduled from a kubernetes cluster for Spark applications. We will share our experience deploying Spark on Nvidia Tesla T4 server clusters. Based on the new NVIDIA Turing architecture, the T4, an energy-efficient 70-watt small PCIe form factor GPU, is optimized for scale-out computing environments and features multi-precision Turing Tensor Cores and new RT Cores.

Watch Richard Whitcomb and Rong Ou present Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA and NCCL at 2019 Spark + AI Summit North America

« back
About Richard Whitcomb

Richard Whitcomb is a Senior Engineer at Nvidia, focusing on scalable machine learning platform. Prior to Nvidia, Richard was a Senior ML Engineer at Spotify working on Scala based Distributed ML Systems and a Staff Engineer at Twitter that integrated Spark with the Twitter stack and the Torch ML stack

About Rong Ou

Rong Ou is a Principal Engineer at Nvidia, working on machine learning and deep learning infrastructure. He introduced mpi-job support into Kubeflow for distributed training on Kubernetes. Prior to Nvidia, Rong was a Staff Engineer at Google.