Communication Efficient LLM Pre-training with SparseLoCo
- URL: http://arxiv.org/abs/2508.15706v2
- Date: Wed, 05 Nov 2025 21:17:49 GMT
- Title: Communication Efficient LLM Pre-training with SparseLoCo
- Authors: Amir Sarfi, Benjamin Thérien, Joel Lidin, Eugene Belilovsky,
- Abstract summary: We introduce SparseLoCo, a communication-efficient training algorithm for Large Language Models (LLMs)<n>SparseLoCo effectively leverages error feedback with Top-k sparsification and 2-bit quantization to reach extreme sparsity as low as 1-3%.<n>We empirically demonstrate in a range of communication-constrained LLM training settings that SparseLoCo provides significant benefits in both performance and communication cost.
- Score: 13.326450941764099
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Communication-efficient distributed training algorithms have received considerable interest recently due to their benefits for training Large Language Models (LLMs) in bandwidth-constrained settings, such as across datacenters and over the internet. Despite reducing communication frequency, these methods still typically require communicating a full copy of the model's gradients-resulting in a communication bottleneck even for cross-datacenter links. Furthermore, they can slightly degrade performance compared to a naive AdamW DDP baseline. While quantization is often applied to reduce the pseudo-gradient's size, in the context of LLM pre-training, existing approaches have been unable to additionally leverage sparsification and have obtained limited quantization. In this work, we introduce SparseLoCo, a communication-efficient training algorithm for LLMs that effectively leverages error feedback with Top-k sparsification and 2-bit quantization to reach extreme sparsity as low as 1-3% while outperforming full-precision DiLoCo. Our key observations are that outer momentum can be locally approximated by an error feedback accumulator combined with aggressive sparsity, and that sparse aggregation can actually improve model performance. We empirically demonstrate in a range of communication-constrained LLM training settings that SparseLoCo provides significant benefits in both performance and communication cost.
Related papers
- Communication-Aware Knowledge Distillation for Federated LLM Fine-Tuning over Wireless Networks [28.49324627841803]
Federated learning (FL) for large language models (LLMs) offers a privacy-preserving scheme, enabling clients to collaboratively fine-tune locally deployed LLMs or smaller language models (SLMs) without exchanging raw data.<n>While parameter-sharing methods in traditional FL models solves number of technical challenges, they still incur high communication overhead.<n>We propose Federated distillation, a framework for mutual knowledge transfer via shared logits.<n>We show that our scheme achieves superior performance compared to baseline methods while effectively reducing communication overhead by approximately 50%.
arXiv Detail & Related papers (2025-09-01T20:10:01Z) - R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs.<n> Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z) - ProFe: Communication-Efficient Decentralized Federated Learning via Distillation and Prototypes [3.7340128675975173]
Decentralized Federated Learning (DFL) trains models in a collaborative and privacy-preserving manner.<n>This paper introduces ProFe, a novel communication optimization algorithm for DFL that combines knowledge distillation, prototype learning, and quantization techniques.
arXiv Detail & Related papers (2024-12-15T14:49:29Z) - CELLM: An Efficient Communication in Large Language Models Training for Federated Learning [0.0]
This thesis aims to develop efficient training methods for large language models (LLMs) in Federated Learning (FL)
First, we use low-rank adaptation (LoRA) to reduce the computational load of local model training.
Second, we communicate sparse updates throughout training to significantly cut down on communication costs.
arXiv Detail & Related papers (2024-07-30T05:24:08Z) - FedComLoc: Communication-Efficient Distributed Training of Sparse and Quantized Models [52.13056951033747]
Federated Learning (FL) has garnered increasing attention due to its unique characteristic of allowing heterogeneous clients to process their private data locally and interact with a central server.<n>A critical bottleneck in FL is the communication cost.<n>Our work is inspired by the innovative Scaffnew algorithm, which has advanced the reduction of communication complexity in FL.<n>We introduce FedComLoc, integrating practical and effective compression into Scaffnew to further enhance communication efficiency.
arXiv Detail & Related papers (2024-03-14T22:29:59Z) - LoCoDL: Communication-Efficient Distributed Learning with Local Training and Compression [56.01900711954956]
We introduce LoCoDL, a communication-efficient algorithm that leverages the two popular and effective techniques of Local training, which reduces the communication frequency, and Compression, in which short bitstreams are sent instead of full-dimensional vectors of floats.<n>LoCoDL provably benefits from local training and compression and enjoys a doubly-accelerated communication complexity, with respect to the condition number of the functions and the model dimension, in the general heterogenous regime with strongly convex functions.
arXiv Detail & Related papers (2024-03-07T09:22:50Z) - Sparse Training for Federated Learning with Regularized Error Correction [9.852567834643292]
Federated Learning (FL) has attracted much interest due to the significant advantages it brings to training deep neural network (DNN) models.
FLARE presents a novel sparse training approach via accumulated pulling of the updated models with regularization on the embeddings in the FL process.
The performance of FLARE is validated through extensive experiments on diverse and complex models, achieving a remarkable sparsity level (10 times and more beyond the current state-of-the-art) along with significantly improved accuracy.
arXiv Detail & Related papers (2023-12-21T12:36:53Z) - Efficient Parallel Split Learning over Resource-constrained Wireless
Edge Networks [44.37047471448793]
In this paper, we advocate the integration of edge computing paradigm and parallel split learning (PSL)
We propose an innovative PSL framework, namely, efficient parallel split learning (EPSL) to accelerate model training.
We show that the proposed EPSL framework significantly decreases the training latency needed to achieve a target accuracy.
arXiv Detail & Related papers (2023-03-26T16:09:48Z) - Fundamental Limits of Communication Efficiency for Model Aggregation in
Distributed Learning: A Rate-Distortion Approach [54.311495894129585]
We study the limit of communication cost of model aggregation in distributed learning from a rate-distortion perspective.
It is found that the communication gain by exploiting the correlation between worker nodes is significant for SignSGD.
arXiv Detail & Related papers (2022-06-28T13:10:40Z) - Adaptive Quantization of Model Updates for Communication-Efficient
Federated Learning [75.45968495410047]
Communication of model updates between client nodes and the central aggregating server is a major bottleneck in federated learning.
Gradient quantization is an effective way of reducing the number of bits required to communicate each model update.
We propose an adaptive quantization strategy called AdaFL that aims to achieve communication efficiency as well as a low error floor.
arXiv Detail & Related papers (2021-02-08T19:14:21Z) - CosSGD: Nonlinear Quantization for Communication-efficient Federated
Learning [62.65937719264881]
Federated learning facilitates learning across clients without transferring local data on these clients to a central server.
We propose a nonlinear quantization for compressed gradient descent, which can be easily utilized in federated learning.
Our system significantly reduces the communication cost by up to three orders of magnitude, while maintaining convergence and accuracy of the training process.
arXiv Detail & Related papers (2020-12-15T12:20:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.