NoLoCo: No-all-reduce Low Communication Training Method for Large Models
- URL: http://arxiv.org/abs/2506.10911v1
- Date: Thu, 12 Jun 2025 17:23:23 GMT
- Title: NoLoCo: No-all-reduce Low Communication Training Method for Large Models
- Authors: Jari Kolehmainen, Nikolay Blagoev, John Donaghy, Oğuzhan Ersoy, Christopher Nies,
- Abstract summary: Training large language models is generally done via optimization methods on clusters containing tens of thousands of accelerators.<n>NoLoCo implicitly synchronizes model weights via a novel variant of the Nesterov momentum by partially averaging model weights with a randomly selected other one.<n>Our method requires significantly less communication overhead than fully sharded data parallel training or even widely used low communication training method, DiLoCo.
- Score: 0.310688583550805
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training large language models is generally done via optimization methods on clusters containing tens of thousands of accelerators, communicating over a high-bandwidth interconnect. Scaling up these clusters is expensive and can become impractical, imposing limits on the size of models that can be trained. Several recent studies have proposed training methods that are less communication intensive, avoiding the need for a highly connected compute cluster. These state-of-the-art low communication training methods still employ a synchronization step for model parameters, which, when performed over all model replicas, can become costly on a low-bandwidth network. In this work, we propose a novel optimization method, NoLoCo, that does not explicitly synchronize all model parameters during training and, as a result, does not require any collective communication. NoLoCo implicitly synchronizes model weights via a novel variant of the Nesterov momentum optimizer by partially averaging model weights with a randomly selected other one. We provide both a theoretical convergence analysis for our proposed optimizer as well as empirical results from language model training. We benchmark NoLoCo on a wide range of accelerator counts and model sizes, between 125M to 6.8B parameters. Our method requires significantly less communication overhead than fully sharded data parallel training or even widely used low communication training method, DiLoCo. The synchronization step itself is estimated to be one magnitude faster than the all-reduce used in DiLoCo for few hundred accelerators training over the internet. We also do not have any global blocking communication that reduces accelerator idling time. Compared to DiLoCo, we also observe up to $4\%$ faster convergence rate with wide range of model sizes and accelerator counts.
Related papers
- DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster [7.597885871452736]
We propose DiLoCoX, a low-communication large-scale decentralized cluster training framework.<n>It combines Pipeline Parallelism with Dual-Step-Delay Overlap of Communication and Local Training, and an Adaptive Gradient Compression Scheme.<n>We show that DiLoCoX can achieve a 357x speedup in distributed training while maintaining negligible degradation in model convergence.
arXiv Detail & Related papers (2025-06-26T13:45:04Z) - Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism [59.79227116582264]
Scaling models has led to significant advancements in deep learning, but training these models in decentralized settings remains challenging.<n>We propose a novel compression algorithm that compresses both forward and backward passes, enabling up to 99% compression with no convergence degradation.
arXiv Detail & Related papers (2025-06-02T02:19:22Z) - AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation.<n>Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads.<n>We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z) - DeMo: Decoupled Momentum Optimization [6.169574689318864]
Training large neural networks typically requires sharing between accelerators through specialized high-speed interconnects.<n>We introduce bfDecoupled textbfMomentum (DeMo), a fused magnitude and data parallel algorithm that reduces inter-accelerator communication requirements.<n> Empirical results show that models trained with DeMo match or exceed the performance of equivalent models trained with AdamW.
arXiv Detail & Related papers (2024-11-29T17:31:47Z) - No Need to Talk: Asynchronous Mixture of Language Models [25.3581396758015]
Smalltalk LM is an innovative method for training a mixture of language models in an almost asynchronous manner.<n>At inference, a lightweight router directs a given sequence to a single expert, according to a short prefix.<n>Experiments on language modeling demonstrate that SMALLTALK LM achieves significantly lower perplexity than dense model baselines.
arXiv Detail & Related papers (2024-10-04T15:50:10Z) - Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices [0.0]
Ravnest facilitates decentralized training by efficiently organizing compute nodes into clusters.
We have framed our asynchronous SGD loss function as a block structured optimization problem with delayed updates.
arXiv Detail & Related papers (2024-01-03T13:07:07Z) - DiLoCo: Distributed Low-Communication Training of Language Models [32.15083548875492]
Large language models (LLM) have been a critical component in many applications of machine learning.
Standard approaches to training LLM require a large number of interconnected accelerators.
We propose a distributed optimization algorithm, Distributed Low-Communication (DiLoCo), that enables training of language models on islands of devices that are poorly connected.
arXiv Detail & Related papers (2023-11-14T12:05:45Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - Training Recommender Systems at Scale: Communication-Efficient Model and
Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training.
DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively.
It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.