Overview
Horovod is a distributed deep learning training framework originally developed by Uber and now part of the LF AI Foundation. It supports PyTorch, TensorFlow, Keras, and Apache MXNet, enabling users to scale deep learning model training across multiple GPUs. Horovod aims to reduce training time from days or weeks to hours or minutes. It allows users to scale existing training scripts with minimal code changes, typically a few lines of Python. Horovod is designed to be portable, running on-premise, in the cloud (AWS, Azure, Databricks), and on Apache Spark. This makes it possible to unify data processing and model training pipelines. By supporting multiple frameworks, Horovod offers flexibility as machine learning tech stacks evolve. It targets data scientists and machine learning engineers seeking to accelerate and scale their deep learning workflows.
Common tasks