TawPipe: Topology-Aware Weight Pipeline Parallelism for Accelerating Long-Context Large Models Training
- URL: http://arxiv.org/abs/2511.09741v1
- Date: Fri, 14 Nov 2025 01:07:20 GMT
- Title: TawPipe: Topology-Aware Weight Pipeline Parallelism for Accelerating Long-Context Large Models Training
- Authors: Houming Wu, Ling Chen,
- Abstract summary: Training large language models (LLMs) is fundamentally constrained by limited device memory and costly inter-device communication.<n>We propose TawPipe, which exploits hierarchical bandwidth in distributed clusters for improved communication efficiency.
- Score: 9.859893936091813
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training large language models (LLMs) is fundamentally constrained by limited device memory and costly inter-device communication. Although pipeline parallelism alleviates memory pressure by partitioning models across devices, it incurs activation communication overhead that scales linearly with sequence length, limiting efficiency in long-context training. Recent weight-passing approaches (e.g., WeiPipe) mitigate this by transmitting model weights instead of activations, but suffer from redundant peer-to-peer (P2P) transfers and underutilized intra-node bandwidth. We propose TawPipe--topology-aware weight pipeline parallelism, which exploits hierarchical bandwidth in distributed clusters for improved communication efficiency. TawPipe: (i) groups devices based on topology to optimize intra-node collective and inter-node P2P communication; (ii) assigns each device a fixed shard of model weights and gradients, avoiding redundant transfers; and (iii) overlaps communication with computation to hide latency. Unlike global collective operations used in fully sharded data parallelism (FSDP), TawPipe confines most communication within node boundaries, significantly reducing cross-node traffic. Extensive experiments on up to 24 GPUs with LLaMA-style models show that TawPipe achieves superior throughput and scalability compared to state-of-the-art baselines.
Related papers
- Wireless Federated Multi-Task LLM Fine-Tuning via Sparse-and-Orthogonal LoRA [61.12136997430116]
Decentralized federated learning (DFL) based on low-rank adaptation (LoRA) enables mobile devices with multi-task datasets to collaboratively fine-tune a large language model (LLM) by exchanging locally updated parameters with a subset of neighboring devices via wireless connections for knowledge integration.<n> directly aggregating parameters fine-tuned on heterogeneous datasets induces three primary issues across the DFL life-cycle: (i) catastrophic knowledge forgetting during fine-tuning process, arising from conflicting update directions caused by data heterogeneity; (ii) textitinefficient communication and convergence during model aggregation process,
arXiv Detail & Related papers (2026-02-24T02:45:32Z) - AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism [54.8494905524997]
We introduce asynchronous updates across both parallelism axes, relaxing the co-location requirement.<n>We provide convergence guarantees for both sparse averaging and asynchronous updates.<n>Experiments on large-scale language models demonstrate that our approach matches the performance of the fully synchronous baseline.
arXiv Detail & Related papers (2026-01-30T01:24:47Z) - Heterogeneous Low-Bandwidth Pre-Training of LLMs [14.653627043173715]
We study whether SparseLoCo, a low-communication data parallel method based on infrequent synchronization and sparse pseudo-gradient exchange, can be combined with low-bandwidth pipeline model parallelism.<n>We introduce a heterogeneous distributed training framework where some participants host full replicas on high-bandwidth interconnects, while resource-limited participants are grouped to jointly instantiate a replica.<n>We find that activation compression composes with SparseLoCo at modest cost, while selective (heterogeneous) compression consistently improves the loss-communication tradeoff.
arXiv Detail & Related papers (2026-01-05T18:59:57Z) - GSPN-2: Efficient Parallel Sequence Modeling [101.33780567131716]
Generalized Spatial Propagation Network (GSPN) addresses this by replacing quadratic self-attention with a line-scan propagation scheme.<n>GSPN-2 establishes a new efficiency frontier for modeling global spatial context in vision applications.
arXiv Detail & Related papers (2025-11-28T07:26:45Z) - Taming the Tail: NoI Topology Synthesis for Mixed DL Workloads on Chiplet-Based Accelerators [8.493759841403682]
Heterogeneous chiplet-based systems improve scaling by disag-gregating CPUs/GPUs and emerging technologies (HBM/DRAM)<n>However this on-package disaggregation introduces a latency inNetwork-on-Interposer(NoI)
arXiv Detail & Related papers (2025-10-28T06:36:44Z) - EARL: Efficient Agentic Reinforcement Learning Systems for Large Language Models [10.372430331898608]
Reinforcement learning (RL) has become a pivotal component of large language model (LLM) post-training.<n>We present EARL, a scalable system for efficient agentic RL.
arXiv Detail & Related papers (2025-10-07T13:52:51Z) - CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks [57.95170323315603]
We introduce CollaPipe, a distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving networks.<n>In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks.<n>To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power.
arXiv Detail & Related papers (2025-09-24T07:54:01Z) - Echo: Decoupling Inference and Training for Large-Scale RL Alignment on Heterogeneous Swarms [4.127488674019288]
Post-training for large language models co-locates trajectory sampling and policy optimisation on the same GPU cluster.<n>We present Echo, the RL system that cleanly decouples these two phases across heterogeneous "inference" and "training" swarms.
arXiv Detail & Related papers (2025-08-07T13:37:04Z) - Model Parallelism With Subnetwork Data Parallelism [21.914077370806016]
Pre-training large neural networks at scale imposes heavy memory demands on accelerators and often requires costly communication.<n>We introduce Subnetwork Data Parallelism (SDP), a distributed training framework that partitions a model into structuredworks trained across workers without exchanging activations.<n>We study two complementary masking regimes: backward masking, which applies sparsity only in the backward step to retain unbiased gradients, and forward masking, which also removes parameters in the forward pass to deliver stronger efficiency gains.
arXiv Detail & Related papers (2025-07-11T21:25:11Z) - TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network [21.231881562816373]
We introduce TAH-Quant (Tile-wise Adaptive Hadamard Quantization), a novel activation quantization framework designed specifically for pipeline parallelism.<n>Our approach integrates fine-grained tile-wise quantization for precise control, entropy-guided token-level adaptive bit allocation for optimal bit usage, and a Hadamard-based transform with pivot element swapping to effectively suppress quantization outliers.
arXiv Detail & Related papers (2025-06-02T06:13:41Z) - Efficient Parallel Split Learning over Resource-constrained Wireless
Edge Networks [44.37047471448793]
In this paper, we advocate the integration of edge computing paradigm and parallel split learning (PSL)
We propose an innovative PSL framework, namely, efficient parallel split learning (EPSL) to accelerate model training.
We show that the proposed EPSL framework significantly decreases the training latency needed to achieve a target accuracy.
arXiv Detail & Related papers (2023-03-26T16:09:48Z) - Parallel Successive Learning for Dynamic Distributed Model Training over
Heterogeneous Wireless Networks [50.68446003616802]
Federated learning (FedL) has emerged as a popular technique for distributing model training over a set of wireless devices.
We develop parallel successive learning (PSL), which expands the FedL architecture along three dimensions.
Our analysis sheds light on the notion of cold vs. warmed up models, and model inertia in distributed machine learning.
arXiv Detail & Related papers (2022-02-07T05:11:01Z) - Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned
Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network.
We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.