Heterogeneous Low-Bandwidth Pre-Training of LLMs
- URL: http://arxiv.org/abs/2601.02360v1
- Date: Mon, 05 Jan 2026 18:59:57 GMT
- Title: Heterogeneous Low-Bandwidth Pre-Training of LLMs
- Authors: Yazan Obeidi, Amir Sarfi, Joel Lidin, Paul Janson, Eugene Belilovsky,
- Abstract summary: We study whether SparseLoCo, a low-communication data parallel method based on infrequent synchronization and sparse pseudo-gradient exchange, can be combined with low-bandwidth pipeline model parallelism.<n>We introduce a heterogeneous distributed training framework where some participants host full replicas on high-bandwidth interconnects, while resource-limited participants are grouped to jointly instantiate a replica.<n>We find that activation compression composes with SparseLoCo at modest cost, while selective (heterogeneous) compression consistently improves the loss-communication tradeoff.
- Score: 14.653627043173715
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-training large language models (LLMs) increasingly requires distributed compute, yet bandwidth constraints make it difficult to scale beyond well-provisioned datacenters-especially when model parallelism forces frequent, large inter-device communications. We study whether SparseLoCo, a low-communication data parallel method based on infrequent synchronization and sparse pseudo-gradient exchange, can be combined with low-bandwidth pipeline model parallelism via activation and activation-gradient compression. We introduce a heterogeneous distributed training framework where some participants host full replicas on high-bandwidth interconnects, while resource-limited participants are grouped to jointly instantiate a replica using pipeline parallelism with subspace-projected inter-stage communication. To make the recently introduced subspace pipeline compression compatible with SparseLoCo, we study a number of adaptations. Across large-scale language modeling experiments (178M-1B parameters) on standard pretraining corpora, we find that activation compression composes with SparseLoCo at modest cost, while selective (heterogeneous) compression consistently improves the loss-communication tradeoff relative to compressing all replicas-especially at aggressive compression ratios. These results suggest a practical path to incorporating low-bandwidth model parallelism and heterogeneous participants into LLM pre-training.
Related papers
- SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training [54.8494905524997]
Decentralized training introduces critical security risks when executed across untrusted, geographically distributed nodes.<n>We propose SENTINEL, a verification mechanism for pipeline parallelism (PP) training without duplication.<n>Experiments demonstrate successful training of up to 4B- parameter LLMs across untrusted distributed environments with up to 176 workers while maintaining model convergence and performance.
arXiv Detail & Related papers (2026-03-03T23:51:10Z) - AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism [54.8494905524997]
We introduce asynchronous updates across both parallelism axes, relaxing the co-location requirement.<n>We provide convergence guarantees for both sparse averaging and asynchronous updates.<n>Experiments on large-scale language models demonstrate that our approach matches the performance of the fully synchronous baseline.
arXiv Detail & Related papers (2026-01-30T01:24:47Z) - TawPipe: Topology-Aware Weight Pipeline Parallelism for Accelerating Long-Context Large Models Training [9.859893936091813]
Training large language models (LLMs) is fundamentally constrained by limited device memory and costly inter-device communication.<n>We propose TawPipe, which exploits hierarchical bandwidth in distributed clusters for improved communication efficiency.
arXiv Detail & Related papers (2025-11-12T21:06:37Z) - Distributed Low-Communication Training with Decoupled Momentum Optimization [38.33322656231618]
Training large models requires substantial computational resources, typically available only in data centers with high-bandwidth interconnects.<n>We propose an approach that further reduces communication by combining infrequent synchronizations across distributed model replicas with momentum gradient compression.<n>In particular, we treat the momentum as a signal and decompose the Nesterov momentum into high- and low-frequency components via the discrete cosine transform.
arXiv Detail & Related papers (2025-10-03T08:25:21Z) - CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks [57.95170323315603]
We introduce CollaPipe, a distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving networks.<n>In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks.<n>To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power.
arXiv Detail & Related papers (2025-09-24T07:54:01Z) - Research on Model Parallelism and Data Parallelism Optimization Methods in Large Language Model-Based Recommendation Systems [6.453224262551299]
Large language models (LLMs) in recommendation systems have become increasingly prominent.<n>This paper systematically investigates two classes of optimization methods-model parallelism and data parallelism.<n> Experiments conducted on a real-world recommendation dataset in a simulated service environment demonstrate that our proposed hybrid parallelism scheme increases training throughput by over 30%.
arXiv Detail & Related papers (2025-06-21T02:37:25Z) - TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network [21.231881562816373]
We introduce TAH-Quant (Tile-wise Adaptive Hadamard Quantization), a novel activation quantization framework designed specifically for pipeline parallelism.<n>Our approach integrates fine-grained tile-wise quantization for precise control, entropy-guided token-level adaptive bit allocation for optimal bit usage, and a Hadamard-based transform with pivot element swapping to effectively suppress quantization outliers.
arXiv Detail & Related papers (2025-06-02T06:13:41Z) - Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism [59.79227116582264]
Scaling models has led to significant advancements in deep learning, but training these models in decentralized settings remains challenging.<n>We propose a novel compression algorithm that compresses both forward and backward passes, enabling up to 99% compression with no convergence degradation.
arXiv Detail & Related papers (2025-06-02T02:19:22Z) - StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation [55.75008325187133]
Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs)<n>StreamRL is designed with disaggregation from first principles to address two types of performance bottlenecks.<n> Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems.
arXiv Detail & Related papers (2025-04-22T14:19:06Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism.
We implement and evaluate three common classes of compression algorithms.
We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.