Skip to main content

Introducing HorovodRunner for Distributed Deep Learning Training

Share this post

Today, we are excited to introduce HorovodRunner in our Databricks Runtime 5.0 ML!

HorovodRunner provides a simple way to scale up your deep learning training workloads from a single machine to large clusters, reducing overall training time.

Motivated by the needs of many of our users who want to train deep learning models on datasets that do not fit on a single machine and reduce overall training time, HorovodRunner addresses this requirement by distributing training across your clusters, hence processing more data per second and decreasing the training time from hours to minutes.

As part of an effort to integrate distributed deep learning with Apache Spark, leveraging Project Hydrogen, HorovodRunner utilizes barrier execution mode introduced in Apache Spark 2.4. This new model of execution is different from the generic Spark execution model and is catered to distributed training in its fault-tolerance needs and modes of communication between tasks on each worker node in the cluster.

In this blog, we describe HorovodRunner and how you can use HorovodRunner’s simple API to train your deep learning model in a distributed fashion, letting Apache Spark handle all the coordination and communication among tasks on each worker node in the cluster.

HorovodRunner’s Simple API

Horovod, Uber’s open source distributed training framework, supports TensorFlow, Keras, and PyTorch. HorovodRunner, built on top of Horovod, inherits the support of these deep learning frameworks and makes it much easier to run.

Under the hood, HorovodRunner shares code and libraries across machines, configures SSH, and executes the complicated MPI commands required for distributed training.

As result, data scientists are freed from the burden of operational requirements and can now focus on tasks at hand—building models, experimenting, and deploying them to production quickly.

Also, HorovodRunner provides a simple interface that allows you to easily distribute your workloads on a cluster. For example, the snippet below runs the train function on 4 worker machines. This can help you achieve good scaling of your workloads, accelerate model experimenting, and shorten the time to production.

The train method below contains the Horovod training code. The sample code outlines the small changes to your single-node workloads to use Horovod. With a few lines of code changes and using HorovodRunner, you can start leveraging the power of a cluster in a matter of minutes.

Integrated Workflow on Databricks

HorovodRunner launches Horovod training jobs as Spark jobs. So your development workflow is exactly the same as other Spark jobs on Databricks. For example, you can check training logs from Spark UI as shown in the animated Fig 1. below.

https://www.youtube.com/watch?v=tlbK4ODANZU

Or you can just as easily trace the error back to a notebook cell and code as shown in animated Fig 2. here.

https://www.youtube.com/watch?v=3uQPrfMKEog

Tools like TensorBoard and Horovod Timeline are also supported within Databricks.

Get started!

To get started, check out example notebooks to classify MNIST dataset using TensorFlow, Keras, or PyTorch in Databricks Runtime 5.0 ML! To migrate your single node workloads to a distributed setting, you can simply follow the steps outlined in this documentation.

Try Databricks today with Apache Spark 2.4 and Databricks Runtime 5.0.

Read More

Get more info on Horovod and HorovodEstimator

Try Databricks for free

Related posts

How to Manage End-to-end Deep Learning Pipelines with Databricks

August 25, 2021 by Oliver Koernig and Ashley Trainor in
Deep Learning (DL) models are being applied to use cases across all industries -- fraud detection in financial services, personalization in media, image...

How (Not) To Scale Deep Learning in 6 Easy Steps

August 15, 2019 by Sean Owen in
Try this notebook in Databricks Introduction: The Problem Deep learning sometimes seems like sorcery. Its state-of-the-art applications are at times delightful and at...

How to Use MLflow, TensorFlow, and Keras with PyCharm

July 10, 2018 by Jules Damji in
At Data + AI Summit in June, we announced MLflow , an open-source platform for the complete machine learning cycle. The platform’s philosophy...
See all Engineering Blog posts