Thomas Graves - Databricks

Thomas Graves

Principal Systems Software Engineer, NVIDIA

Thomas Graves is a distributed systems software engineer at NVIDIA, where he concentrates on accelerating Spark. He is a committer and PMC on Apache Spark and Apache Hadoop. Previously worked for Yahoo on the Big Data Platform team working on Apache Spark, Hadoop, YARN, Storm, and Kafka.

UPCOMING SESSIONS

End-to-end Deep Learning with Horovod on Apache SparkSummit 2020

Data processing and deep learning are often split into two pipelines, one for ETL processing, the second for model training. Enabling deep learning frameworks to integrate seamlessly with ETL jobs allows for more streamlined production jobs, with faster iteration between feature engineering and model training. The newly introduced Horovod Spark Estimator API enables TensorFlow and PyTorch models to be trained directly on Spark DataFrames, leveraging Horovod's ability to scale to hundreds of GPUs in parallel, without any specialized code for distributed training. With the new accelerator aware scheduling and columnar processing APIs in Apache Spark 3.0, a production ETL job can hand off data to Horovod running distributed deep learning training on GPUs within the same pipeline.

This breaks down the barriers between ETL and continuous model training. Operational and management tasks are lower, and data processing and cleansing is more directly connected to model training. This talk covers an end to end pipeline, demonstrating ETL and DL as separate pipelines, and Apache Spark 3.0 ETL with the Horovod Spark Estimator API to enable a single pipeline. We will demonstrate 2 pipelines - one using Databricks with Jupyter notebooks to run ETL and Horovod, the second on YARN to run a single application to transition from ETL to DL using Horovod. The use of accelerators across both pipelines and Horovod features will be discussed.

PAST SESSIONS

Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS Library-continuesSummit Europe 2019

GPU acceleration has been at the heart of scientific computing and artificial intelligence for many years now. GPUs provide the computational power needed for the most demanding applications such as Deep Neural Networks, nuclear or weather simulation. Since the launch of RAPIDS in mid-2018, this vast computational resource has become available for Data Science workloads too. The RAPIDS toolkit, which is now available on the Databricks Unified Analytics Platform, is a GPU-accelerated drop-in replacement for utilities such as Pandas/NumPy/ScikitLearn/XGboost.

Through its use of Dask wrappers the platform allows for true, large scale computation with minimal, if any, code changes. The goal of this talk is to discuss RAPIDS, its functionality, architecture as well as the way it integrates with Spark providing on many occasions several orders of magnitude acceleration versus its CPU-only counterparts.

Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS LibrarySummit Europe 2019

GPU acceleration has been at the heart of scientific computing and artificial intelligence for many years now. GPUs provide the computational power needed for the most demanding applications such as Deep Neural Networks, nuclear or weather simulation. Since the launch of RAPIDS in mid-2018, this vast computational resource has become available for Data Science workloads too. The RAPIDS toolkit, which is now available on the Databricks Unified Analytics Platform, is a GPU-accelerated drop-in replacement for utilities such as Pandas/NumPy/ScikitLearn/XGboost. Through its use of Dask wrappers the platform allows for true, large scale computation with minimal, if any, code changes.

The goal of this talk is to discuss RAPIDS, its functionality, architecture as well as the way it integrates with Spark providing on many occasions several orders of magnitude acceleration versus its CPU-only counterparts.