Related papers: DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster

DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster

URL: http://arxiv.org/abs/2506.21263v1
Date: Thu, 26 Jun 2025 13:45:04 GMT
Title: DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster
Authors: Ji Qi, WenPeng Zhu, Li Li, Ming Wu, YingJun Wu, Wu He, Xun Gao, Jason Zeng, Michael Heinrich,
Abstract summary: We propose DiLoCoX, a low-communication large-scale decentralized cluster training framework.<n>It combines Pipeline Parallelism with Dual-Step-Delay Overlap of Communication and Local Training, and an Adaptive Gradient Compression Scheme.<n>We show that DiLoCoX can achieve a 357x speedup in distributed training while maintaining negligible degradation in model convergence.
Score: 7.597885871452736
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The distributed training of foundation models, particularly large language models (LLMs), demands a high level of communication. Consequently, it is highly dependent on a centralized cluster with fast and reliable interconnects. Can we conduct training on slow networks and thereby unleash the power of decentralized clusters when dealing with models exceeding 100 billion parameters? In this paper, we propose DiLoCoX, a low-communication large-scale decentralized cluster training framework. It combines Pipeline Parallelism with Dual Optimizer Policy, One-Step-Delay Overlap of Communication and Local Training, and an Adaptive Gradient Compression Scheme. This combination significantly improves the scale of parameters and the speed of model pre-training. We justify the benefits of one-step-delay overlap of communication and local training, as well as the adaptive gradient compression scheme, through a theoretical analysis of convergence. Empirically, we demonstrate that DiLoCoX is capable of pre-training a 107B foundation model over a 1Gbps network. Compared to vanilla AllReduce, DiLoCoX can achieve a 357x speedup in distributed training while maintaining negligible degradation in model convergence. To the best of our knowledge, this is the first decentralized training framework successfully applied to models with over 100 billion parameters.

Related papers

NoLoCo: No-all-reduce Low Communication Training Method for Large Models [0.310688583550805]
Training large language models is generally done via optimization methods on clusters containing tens of thousands of accelerators.<n>NoLoCo implicitly synchronizes model weights via a novel variant of the Nesterov momentum by partially averaging model weights with a randomly selected other one.<n>Our method requires significantly less communication overhead than fully sharded data parallel training or even widely used low communication training method, DiLoCo.
arXiv Detail & Related papers (2025-06-12T17:23:23Z)
Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism [59.79227116582264]
Scaling models has led to significant advancements in deep learning, but training these models in decentralized settings remains challenging.<n>We propose a novel compression algorithm that compresses both forward and backward passes, enabling up to 99% compression with no convergence degradation.
arXiv Detail & Related papers (2025-06-02T02:19:22Z)
Decentralized Diffusion Models [53.89995588977048]
Large-scale AI model training divides work across thousands of GPU, then synchronizes gradients across them at each step.<n>This incurs a significant network burden that only centralized, monolithic clusters can support.<n>We propose Decentralized Diffusion Models, a scalable framework for distributing diffusion model training across independent clusters.
arXiv Detail & Related papers (2025-01-09T18:59:56Z)
DeMo: Decoupled Momentum Optimization [6.169574689318864]
Training large neural networks typically requires sharing between accelerators through specialized high-speed interconnects.<n>We introduce bfDecoupled textbfMomentum (DeMo), a fused magnitude and data parallel algorithm that reduces inter-accelerator communication requirements.<n> Empirical results show that models trained with DeMo match or exceed the performance of equivalent models trained with AdamW.
arXiv Detail & Related papers (2024-11-29T17:31:47Z)
Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices [0.0]
Ravnest facilitates decentralized training by efficiently organizing compute nodes into clusters. We have framed our asynchronous SGD loss function as a block structured optimization problem with delayed updates.
arXiv Detail & Related papers (2024-01-03T13:07:07Z)
Simplifying Distributed Neural Network Training on Massive Graphs: Randomized Partitions Improve Model Aggregation [23.018715954992352]
We present a simplified framework for distributed GNN training that does not rely on the aforementioned costly operations. Specifically, our framework assembles independent trainers, each of which asynchronously learns a local model on locally-available parts of the training graph. In experiments on social and e-commerce networks with up to 1.3 billion edges, our proposed RandomTMA and SuperTMA approaches achieve state-of-the-art performance and 2.31x speedup compared to the fastest baseline.
arXiv Detail & Related papers (2023-05-17T01:49:44Z)
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z)
Decentralized Training of Foundation Models in Heterogeneous Environments [77.47261769795992]
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive. We present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network.
arXiv Detail & Related papers (2022-06-02T20:19:51Z)
Simultaneous Training of Partially Masked Neural Networks [67.19481956584465]
We show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance. We show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone.
arXiv Detail & Related papers (2021-06-16T15:57:51Z)
Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training. DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively. It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.