Travis Addair is a software engineer at Uber and technical lead for the Deep Learning Training team as part of the Michelangelo AI platform. In the open source community, he serves as lead maintainer for the Horovod distributed deep learning framework and is a co-maintainer of the Ludwig auto ML framework. In the past, he’s worked on scaling machine learning systems at Google and Lawrence Livermore National Lab.
Data processing and deep learning are often split into two pipelines, one for ETL processing, the second for model training. Enabling deep learning frameworks to integrate seamlessly with ETL jobs allows for more streamlined production jobs, with faster iteration between feature engineering and model training. The newly introduced Horovod Spark Estimator API enables TensorFlow and PyTorch models to be trained directly on Spark DataFrames, leveraging Horovod's ability to scale to hundreds of GPUs in parallel, without any specialized code for distributed training. With the new accelerator aware scheduling and columnar processing APIs in Apache Spark 3.0, a production ETL job can hand off data to Horovod running distributed deep learning training on GPUs within the same pipeline.
This breaks down the barriers between ETL and continuous model training. Operational and management tasks are lower, and data processing and cleansing is more directly connected to model training. This talk covers an end to end pipeline, demonstrating ETL and DL as separate pipelines, and Apache Spark 3.0 ETL with the Horovod Spark Estimator API to enable a single pipeline. We will demonstrate 2 pipelines - one using Databricks with Jupyter notebooks to run ETL and Horovod, the second on YARN to run a single application to transition from ETL to DL using Horovod. The use of accelerators across both pipelines and Horovod features will be discussed.