Thomas Graves

Distributed Systems Software Engineer, NVIDIA

Thomas Graves is a distributed systems software engineer at NVIDIA, where he concentrates on accelerating Spark. He is a committer and PMC on Apache Spark and Apache Hadoop. Previously worked for Yahoo on the Big Data Platform team working on Apache Spark, Hadoop, YARN, Storm, and Kafka.

Past sessions

In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.

There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.

The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.

In this session watch:
Thomas Graves, Distributed Systems Software Engineer, NVIDIA

[daisna21-sessions-od]

Summit 2020 End-to-End Deep Learning with Horovod on Apache Spark

June 23, 2020 05:00 PM PT

Data processing and deep learning are often split into two pipelines, one for ETL processing, the second for model training. Enabling deep learning frameworks to integrate seamlessly with ETL jobs allows for more streamlined production jobs, with faster iteration between feature engineering and model training. The newly introduced Horovod Spark Estimator API enables TensorFlow and PyTorch models to be trained directly on Spark DataFrames, leveraging Horovod's ability to scale to hundreds of GPUs in parallel, without any specialized code for distributed training. With the new accelerator aware scheduling and columnar processing APIs in Apache Spark 3.0, a production ETL job can hand off data to Horovod running distributed deep learning training on GPUs within the same pipeline.

This breaks down the barriers between ETL and continuous model training. Operational and management tasks are lower, and data processing and cleansing is more directly connected to model training. This talk covers an end to end pipeline, demonstrating ETL and DL as separate pipelines, and Apache Spark 3.0 ETL with the Horovod Spark Estimator API to enable a single pipeline. We will demonstrate 2 pipelines - one using Databricks with Jupyter notebooks to run ETL and Horovod, the second on YARN to run a single application to transition from ETL to DL using Horovod. The use of accelerators across both pipelines and Horovod features will be discussed.

GPU acceleration has been at the heart of scientific computing and artificial intelligence for many years now. GPUs provide the computational power needed for the most demanding applications such as Deep Neural Networks, nuclear or weather simulation. Since the launch of RAPIDS in mid-2018, this vast computational resource has become available for Data Science workloads too. The RAPIDS toolkit, which is now available on the Databricks Unified Analytics Platform, is a GPU-accelerated drop-in replacement for utilities such as Pandas/NumPy/ScikitLearn/XGboost.

Through its use of Dask wrappers the platform allows for true, large scale computation with minimal, if any, code changes. The goal of this talk is to discuss RAPIDS, its functionality, architecture as well as the way it integrates with Spark providing on many occasions several orders of magnitude acceleration versus its CPU-only counterparts.

GPU acceleration has been at the heart of scientific computing and artificial intelligence for many years now. GPUs provide the computational power needed for the most demanding applications such as Deep Neural Networks, nuclear or weather simulation. Since the launch of RAPIDS in mid-2018, this vast computational resource has become available for Data Science workloads too. The RAPIDS toolkit, which is now available on the Databricks Unified Analytics Platform, is a GPU-accelerated drop-in replacement for utilities such as Pandas/NumPy/ScikitLearn/XGboost. Through its use of Dask wrappers the platform allows for true, large scale computation with minimal, if any, code changes.

The goal of this talk is to discuss RAPIDS, its functionality, architecture as well as the way it integrates with Spark providing on many occasions several orders of magnitude acceleration versus its CPU-only counterparts.