Horovod distributed framework | Notion

Description

Horovod is a distributed deep learning training framework that supports TensorFlow, Keras, PyTorch, and Apache MXNet. It was originally developed by Uber to make distributed deep learning fast and easy to use.

How it Works

Horovod allows a single-GPU training script to be scaled up to train across many GPUs in parallel.
It uses the Message Passing Interface (MPI) model for inter-GPU communication.
It employs efficient inter-GPU communication via ring reduction.
Once a training script has been written for scale with Horovod, it can run on a single-GPU, multiple-GPUs, or even multiple hosts without any further code changes.

Benefits

Ease of use: Horovod requires minimal code changes to scale a single-GPU training script to multiple GPUs.
Efficiency: Horovod achieves high scaling efficiency, making distributed deep learning faster.
Versatility: Horovod supports multiple deep learning frameworks, including TensorFlow, Keras, PyTorch, and Apache MXNet.

Limitations

Inter-GPU communication: Efficient inter-GPU communication requires a high-speed network, which may not be available in all environments.
MPI dependency: Horovod’s reliance on the MPI model means that users must have knowledge of MPI concepts.
Scaling limitations: While Horovod can scale up to many GPUs, there may be diminishing returns as the number of GPUs increases.

Features

Ring reduction: Horovod uses ring reduction for efficient inter-GPU communication.
Multi-framework support: Horovod supports TensorFlow, Keras, PyTorch, and Apache MXNet.
MPI model: Horovod uses the MPI model, which is straightforward and requires fewer code changes than other models.

Use Cases