Related papers: Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

URL: http://arxiv.org/abs/2501.18512v1
Date: Thu, 30 Jan 2025 17:23:50 GMT
Title: Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch
Authors: Arthur Douillard, Yanislav Donchev, Keith Rush, Satyen Kale, Zachary Charles, Zachary Garrett, Gabriel Teston, Dave Lacey, Ross McIlroy, Jiajun Shen, Alexandre Ramé, Arthur Szlam, Marc'Aurelio Ranzato, Paul Barham,
Abstract summary: Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time.<n>Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint.<n>We show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before.
Score: 66.84195842685459
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time. Since internal states and parameter gradients need to be exchanged at each and every single gradient step, all devices need to be co-located using low-latency high-bandwidth communication links to support the required high volume of exchanged bits. Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint: accelerators can be grouped into ``workers'', where synchronizations between workers only occur infrequently. This in turn means that workers can afford being connected by lower bandwidth communication links without affecting learning quality. However, in these methods, communication across workers still requires the same peak bandwidth as before, as the synchronizations require all parameters to be exchanged across all workers. In this paper, we improve DiLoCo in three ways. First, we synchronize only subsets of parameters in sequence, rather than all at once, which greatly reduces peak bandwidth. Second, we allow workers to continue training while synchronizing, which decreases wall clock time. Third, we quantize the data exchanged by workers, which further reduces bandwidth across workers. By properly combining these modifications, we show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before, but reducing required bandwidth by two orders of magnitude.

Related papers

NoLoCo: No-all-reduce Low Communication Training Method for Large Models [0.310688583550805]
Training large language models is generally done via optimization methods on clusters containing tens of thousands of accelerators.<n>NoLoCo implicitly synchronizes model weights via a novel variant of the Nesterov momentum by partially averaging model weights with a randomly selected other one.<n>Our method requires significantly less communication overhead than fully sharded data parallel training or even widely used low communication training method, DiLoCo.
arXiv Detail & Related papers (2025-06-12T17:23:23Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
Shadowheart SGD: Distributed Asynchronous SGD with Optimal Time Complexity Under Arbitrary Computation and Communication Heterogeneity [85.92481138826949]
We develop a new method-Shadowheart SGD-that provably improves the time complexities of all previous centralized methods. We also consider the bidirectional setup, where broadcasting from the server to the workers is non-negligible, and develop a corresponding method.
arXiv Detail & Related papers (2024-02-07T12:15:56Z)
Accelerating Distributed ML Training via Selective Synchronization [0.0]
textttSelSync is a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step. Our system converges to the same or better accuracy than BSP while reducing training time by up to 14$times$.
arXiv Detail & Related papers (2023-07-16T05:28:59Z)
DropCompute: simple and more robust distributed synchronous training via compute variance reduction [30.46681332866494]
We study a typical scenario in which workers are straggling due to variability in compute time. We propose a simple yet effective decentralized method to reduce the variation among workers and thus improve the robustness of synchronous training.
arXiv Detail & Related papers (2023-06-18T16:55:31Z)
$\textbf{A}^2\textbf{CiD}^2$: Accelerating Asynchronous Communication in Decentralized Deep Learning [0.0]
We introduce a principled asynchronous, randomized, gossip-based optimization algorithm which works thanks to a continuous local momentum named $textbfA2textbfCiD2$. Our theoretical analysis proves accelerated rates compared to previous asynchronous decentralized baselines. We show consistent improvement on the ImageNet dataset using up to 64 asynchronous workers.
arXiv Detail & Related papers (2023-06-14T06:52:07Z)
Straggler-Resilient Distributed Machine Learning with Dynamic Backup Workers [9.919012793724628]
We propose a fully distributed algorithm to determine the number of backup workers for each worker. Our algorithm achieves a linear speedup for convergence (i.e., convergence performance increases linearly with respect to the number of workers)
arXiv Detail & Related papers (2021-02-11T21:39:53Z)
Faster Non-Convex Federated Learning via Global and Local Momentum [57.52663209739171]
textttFedGLOMO is the first (first-order) FLtexttFedGLOMO algorithm. Our algorithm is provably optimal even with communication between the clients and the server.
arXiv Detail & Related papers (2020-12-07T21:05:31Z)
Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training. DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively. It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z)
Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models. In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers. We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z)
Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models. This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
Communication-Efficient Decentralized Learning with Sparsification and Adaptive Peer Selection [13.963329236804586]
We introduce a novel decentralized training algorithm with the following key features. Each worker only needs to communicate with a single peer at each communication round with a highly compressed model. Experimental results show that our algorithm significantly reduces the communication traffic and generally selects relatively high bandwidth peers.
arXiv Detail & Related papers (2020-02-22T12:31:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.