ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training
- URL: http://arxiv.org/abs/2406.02613v3
- Date: Tue, 14 Oct 2025 12:26:33 GMT
- Title: ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training
- Authors: Adel Nabli, Louis Fournier, Pierre Erbacher, Louis Serrano, Eugene Belilovsky, Edouard Oyallon,
- Abstract summary: We propose a memory-efficient optimization algorithm for distributed training LLMs.<n>By synchronizing delayed gradients while computing new ones, ACCO reduces GPU idle time and supports heterogeneous hardware.<n>Compared to ZeRO-1, our approach is significantly faster and scales effectively across heterogeneous hardware.
- Score: 22.940404796500985
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training LLMs relies on distributed implementations using multiple GPUs to compute gradients in parallel with sharded optimizers. However, synchronizing gradients in data parallel setups introduces communication overhead that grows with the number of workers, limiting parallelization efficiency. Local optimization algorithms reduce communications but incur high memory costs as they prevent optimizer state sharding, hindering scalability. To address this, we propose \textbf{AC}cumulate while \textbf{CO}mmunicate (ACCO), a memory-efficient optimization algorithm for distributed LLM training. By synchronizing delayed gradients while computing new ones, ACCO reduces GPU idle time and supports heterogeneous hardware. To mitigate the convergence issues caused by delayed updates, we introduce a novel technique ensuring training dynamics align with standard distributed optimization. Compared to ZeRO-1, our approach is significantly faster and scales effectively across heterogeneous hardware.
Related papers
- AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism [54.8494905524997]
We introduce asynchronous updates across both parallelism axes, relaxing the co-location requirement.<n>We provide convergence guarantees for both sparse averaging and asynchronous updates.<n>Experiments on large-scale language models demonstrate that our approach matches the performance of the fully synchronous baseline.
arXiv Detail & Related papers (2026-01-30T01:24:47Z) - First Provably Optimal Asynchronous SGD for Homogeneous and Heterogeneous Data [0.0]
dissertation develops a rigorous framework for asynchronous order optimization.<n>We show that with proper design, asynchronous SGD can achieve optimal time complexity, matching guarantees previously known only for synchronous methods.
arXiv Detail & Related papers (2026-01-05T19:51:09Z) - Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z) - AReaL-Hex: Accommodating Asynchronous RL Training over Heterogeneous GPUs [24.96730768606278]
We present AReaL-Hex, a heterogeneous-aware asynchronous RL training system.<n>It effectively schedules how to execute rollout generation and policy model training over heterogeneous GPUs.<n>It delivers up to 1.50x higher training throughput and 1.46x reduction in training cost.
arXiv Detail & Related papers (2025-11-02T04:17:30Z) - AsyncHZP: Hierarchical ZeRO Parallelism with Asynchronous Scheduling for Scalable LLM Training [4.643969942380424]
We propose a novel asynchronous variant of ZeRO to achieve superior performance while maintaining simplicity and memory efficiency.<n>Unlike traditional ZeRO, which employs over-fine-grained sharding that can lead to inefficient communication, AsyncHZP adaptively reshards parameters, gradients, and states across different replica groups.<n>AsyncHZP consistently outperforms classic ND parallelism, achieving state-of-the-art performance without complex strategic tuning.
arXiv Detail & Related papers (2025-10-23T01:29:35Z) - CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks [57.95170323315603]
We introduce CollaPipe, a distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving networks.<n>In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks.<n>To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power.
arXiv Detail & Related papers (2025-09-24T07:54:01Z) - AdLoCo: adaptive batching significantly improves communications efficiency and convergence for Large Language Models [34.54482364155804]
We propose a three-stage method that combines Multi-Instance Training (MIT), Adaptive Batched DiLoCo, and switch mode mechanism.<n>MIT allows individual nodes to run multiple lightweight training streams with different model instances in parallel.<n> Adaptive Batched DiLoCo dynamically adjusts local batch sizes to balance computation and communication.<n>Switch mode seamlessly introduces gradient accumulation once adaptive batch sizes grow beyond hardware-friendly limits.
arXiv Detail & Related papers (2025-08-25T16:35:57Z) - Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints [14.341123057506827]
Large Language Models (LLMs) are indispensable in today's applications, but their inference procedure demands significant computational resources.<n>This paper formulates LLM inference optimization as a multi-stage online scheduling problem.<n>We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design.
arXiv Detail & Related papers (2025-04-15T16:00:21Z) - Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts [8.80408909878008]
Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters.
Existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping.
We present COMET, an optimized MoE system with fine-grained communication-computation overlapping.
arXiv Detail & Related papers (2025-02-27T06:36:45Z) - Near-Optimal Online Learning for Multi-Agent Submodular Coordination: Tight Approximation and Communication Efficiency [52.60557300927007]
We present a $textbfMA-OSMA$ algorithm to transfer the discrete submodular problem into a continuous optimization.
We also introduce a projection-free $textbfMA-OSEA$ algorithm, which effectively utilizes the KL divergence by mixing a uniform distribution.
Our algorithms significantly improve the $(frac11+c)$-approximation provided by the state-of-the-art OSG algorithm.
arXiv Detail & Related papers (2025-02-07T15:57:56Z) - Split Federated Learning Over Heterogeneous Edge Devices: Algorithm and Optimization [7.013344179232109]
Split Learning (SL) is a promising collaborative machine learning approach, enabling resource-constrained devices to train models without sharing raw data.
Current SL algorithms face limitations in training efficiency and suffer from prolonged latency.
We propose the Heterogeneous Split Federated Learning framework, which allows resource-constrained clients to train their personalized client-side models in parallel.
arXiv Detail & Related papers (2024-11-21T07:46:01Z) - A Quadratic Synchronization Rule for Distributed Deep Learning [66.68264684667562]
This work proposes a theory-grounded method for determining $H$, named the Quadratic Synchronization Rule (QSR)
Experiments on ResNet and ViT show that local gradient methods with QSR consistently improve the test accuracy over other synchronization strategies.
arXiv Detail & Related papers (2023-10-22T21:38:57Z) - Accelerating Distributed ML Training via Selective Synchronization [0.0]
textttSelSync is a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step.
Our system converges to the same or better accuracy than BSP while reducing training time by up to 14$times$.
arXiv Detail & Related papers (2023-07-16T05:28:59Z) - $\textbf{A}^2\textbf{CiD}^2$: Accelerating Asynchronous Communication in
Decentralized Deep Learning [0.0]
We introduce a principled asynchronous, randomized, gossip-based optimization algorithm which works thanks to a continuous local momentum named $textbfA2textbfCiD2$.
Our theoretical analysis proves accelerated rates compared to previous asynchronous decentralized baselines.
We show consistent improvement on the ImageNet dataset using up to 64 asynchronous workers.
arXiv Detail & Related papers (2023-06-14T06:52:07Z) - TAMUNA: Doubly Accelerated Distributed Optimization with Local Training, Compression, and Partial Participation [53.84175614198885]
In distributed optimization and learning, several machines alternate between local computations in parallel and communication with a distant server.
We propose TAMUNA, the first algorithm for distributed optimization that leveraged the two strategies of local training and compression jointly and allows for partial participation.
arXiv Detail & Related papers (2023-02-20T08:37:44Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization [0.0]
DADAO is the first decentralized, accelerated, asynchronous, primal, first-order algorithm to minimize a sum of $L$-smooth and $mu$-strongly convex functions distributed over a given network of size $n$.
We show that our algorithm requires $mathcalO(nsqrtchisqrtfracLmulog(frac1epsilon)$ local and only $mathcalO(nsqrtchisqrtfracLmulog(
arXiv Detail & Related papers (2022-07-26T08:47:54Z) - Communication-Efficient TeraByte-Scale Model Training Framework for
Online Advertising [32.5337643852876]
Click-Through Rate (CTR) prediction is a crucial component in the online advertising industry.
We identify two major challenges in the existing GPU training for massivescale ad models.
We propose a hardware-aware training workflow that couples the hardware topology into the algorithm design.
arXiv Detail & Related papers (2022-01-05T18:09:11Z) - Implementation of Parallel Simplified Swarm Optimization in CUDA [2.322689362836168]
In optimization computing, intelligent swarm algorithms (SIAs) method is suitable for parallelization.
This paper proposed a GPU-based Simplified Swarm Algorithm Optimization (PSSO) based on the platform considering computational ability and versatility.
As the results showed, the time complexity has successfully reduced by an order of magnitude of N, and the problem of resource preemption was avoided entirely.
arXiv Detail & Related papers (2021-10-01T00:15:45Z) - AsySQN: Faster Vertical Federated Learning Algorithms with Better
Computation Resource Utilization [159.75564904944707]
We propose an asynchronous quasi-Newton (AsySQN) framework for vertical federated learning (VFL)
The proposed algorithms make descent steps scaled by approximate without calculating the inverse Hessian matrix explicitly.
We show that the adopted asynchronous computation can make better use of the computation resource.
arXiv Detail & Related papers (2021-09-26T07:56:10Z) - Stochastic Optimization with Laggard Data Pipelines [65.20044914532221]
We show that "dataechoed" extensions of common optimization methods exhibit provable improvements over their synchronous counterparts.
Specifically, we show that in convex optimization with minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.
arXiv Detail & Related papers (2020-10-26T14:55:31Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z) - Straggler-aware Distributed Learning: Communication Computation Latency
Trade-off [56.08535873173518]
Straggling workers can be tolerated by assigning redundant computations and coding across data and computations.
In most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations.
Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler.
arXiv Detail & Related papers (2020-04-10T08:39:36Z) - Communication Contention Aware Scheduling of Multiple Deep Learning
Training Jobs [17.45154289084637]
We establish a new DDL job scheduling framework which organizes DDL jobs as Directed Acyclic Graphs (DAGs)
We then propose an efficient algorithm, LWF-$kappa$, to balance the GPU utilization and consolidate the allocated GPU for each job.
We show that LWF-$kappa$ achieves up to $1.59times$ improvement over the classical first-fit algorithms.
arXiv Detail & Related papers (2020-02-24T07:50:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.