Related papers: DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism

DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism

URL: http://arxiv.org/abs/2510.10620v1
Date: Sun, 12 Oct 2025 14:01:32 GMT
Title: DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism
Authors: Chenyu Jiang, Zhenkun Cai, Ye Tian, Zhen Jia, Yida Wang, Chuan Wu,
Abstract summary: DCP is a dynamic context parallel training framework that introduces fine-grained blockwise partitioning of both data and computation.<n> DCP accelerates attention by 1.19x2.45x under causal masks and 2.15x3.77x under sparse attention patterns.
Score: 14.218532777707091
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Context parallelism has emerged as a key technique to support long-context training, a growing trend in generative AI for modern large models. However, existing context parallel methods rely on static parallelization configurations that overlook the dynamic nature of training data, specifically, the variability in sequence lengths and token relationships (i.e., attention patterns) across samples. As a result, these methods often suffer from unnecessary communication overhead and imbalanced computation. In this paper, we present DCP, a dynamic context parallel training framework that introduces fine-grained blockwise partitioning of both data and computation. By enabling flexible mapping of data and computation blocks to devices, DCP can adapt to varying sequence characteristics, effectively reducing communication and improving memory and computation balance. Micro-benchmarks demonstrate that DCP accelerates attention by 1.19x~2.45x under causal masks and 2.15x~3.77x under sparse attention patterns. Additionally, we observe up to 0.94x~1.16x end-to-end training speed-up for causal masks, and 1.00x~1.46x for sparse masks.

Related papers

ParEVO: Synthesizing Code for Irregular Data: High-Performance Parallelism through Agentic Evolution [13.109726609738749]
ParEVO is a framework designed to synthesize high-performance parallel algorithms for irregular data.<n>On the ParEval benchmark, ParEVO achieves an average 106x speedup, and a robust 13.6x speedup on complex irregular graph problems.
arXiv Detail & Related papers (2026-03-03T01:41:07Z)
DART-ing Through the Drift: Dynamic Tracing of Knowledge Neurons for Adaptive Inference-Time Pruning [6.3691159627915015]
We introduce DART, a lightweight training-free method that performs on-the-fly context-based pruning.<n>DART monitors shifts in distributions to infer context changes, dynamically updating neuron-level masks to retain salient parameters.<n>It achieves accuracy gains of up to 14.5% on LLAMA-3.1-8B at 70% FFN sparsity and 3x better ROUGE-L scores with respect to static-masked pruning.
arXiv Detail & Related papers (2026-01-30T06:48:16Z)
LINA: Linear Autoregressive Image Generative Models with Continuous Tokens [56.80443965097921]
Autoregressive models with continuous tokens form a promising paradigm for visual generation, especially for text-to-image (T2I) synthesis.<n>We study how to design compute-efficient linear attention within this framework.<n>We present LINA, a simple and compute-efficient T2I model built entirely on linear attention, capable of generating high-fidelity 1024x1024 images from user instructions.
arXiv Detail & Related papers (2026-01-30T06:44:33Z)
Training-free Context-adaptive Attention for Efficient Long Context Modeling [57.703159205740185]
Training-free Context-adaptive Attention (TCA-Attention) is a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference.<n>TCA-Attention achieves a 2.8$times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention.
arXiv Detail & Related papers (2025-12-10T01:54:57Z)
db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism [14.406306253079515]
Scaling Diffusion Transformer (DiT) inference via sequence parallelism is critical for reducing latency in visual generation.<n>We formalize a sparse imbalance ratio to quantify the imbalance, and propose db-SP, a sparsity-aware sequence parallelism technique.<n>We show that db-SP delivers an end-to-end speedup of 1.25x and an attention-specific speedup of 1.40x over state-of-the-art sequence parallel methods.
arXiv Detail & Related papers (2025-11-28T11:55:46Z)
MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training [23.925430484357975]
MTraining is a distributed methodology for training Large Language Models with ultra-long contexts.<n>MTraining integrates a dynamic sparse training pattern, balanced sparse ring attention, and hierarchical sparse ring attention.<n>MTraining achieves up to a 6x higher training throughput while preserving model accuracy.
arXiv Detail & Related papers (2025-10-21T17:25:32Z)
ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs [34.477777651648914]
Large language models (LLMs) pose significant inference latency challenges due to their autoregressive decoding paradigm.<n>We propose an Adaptive Serial-Parallel Decoding (ASPD) which addresses two core challenges: automated construction of parallelizable data and efficient parallel decoding mechanism.<n>Our framework sets a groundbreaking benchmark for efficient LLM parallel inference, paving the way for its deployment in latency-sensitive applications such as AI-powered customer service bots and answer retrieval engines.
arXiv Detail & Related papers (2025-08-12T12:35:55Z)
ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs [22.542224045868117]
We introduce ByteScale, an efficient framework for large-scale mixed training of long and short sequences.<n>ByteScale is based on Hybrid Data Parallelism (HDP), which unifies the inter- and intra-data partitioning with a dynamic mesh design.<n>Experiment results show that ByteScale outperforms the state-of-the-art training system by up to 7.89x.
arXiv Detail & Related papers (2025-02-28T17:01:03Z)
ParallelComp: Parallel Long-Context Compressor for Length Extrapolation [51.68913021512016]
Extrapolating ultra-long contexts (text length >128K) remains a major challenge for large language models (LLMs)<n>In this work, we propose ParallelComp, a parallel long-context compression method that effectively overcomes the memory bottleneck.<n>We achieve a 1.76x improvement in chunk throughput, thereby achieving a 23.50x acceleration in the prefill stage with negligible performance loss.
arXiv Detail & Related papers (2025-02-20T07:10:43Z)
DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training [85.04885553561164]
Diffusion Transformers (DiTs) have shown remarkable performance in generating high-quality videos.<n>DiTs can consume up to 95% of processing time and demand specialized context parallelism.<n>This paper introduces DSV to accelerate video DiT training by leveraging the dynamic attention sparsity we empirically observe.
arXiv Detail & Related papers (2025-02-11T14:39:59Z)
Parallel Sequence Modeling via Generalized Spatial Propagation Network [80.66202109995726]
Generalized Spatial Propagation Network (GSPN) is a new attention mechanism for optimized vision tasks that inherently captures 2D spatial structures.<n>GSPN overcomes limitations by directly operating on spatially coherent image data and forming dense pairwise connections through a line-scan approach.<n>GSPN achieves superior spatial fidelity and state-of-the-art performance in vision tasks, including ImageNet classification, class-guided image generation, and text-to-image generation.
arXiv Detail & Related papers (2025-01-21T18:56:19Z)
GNNPipe: Scaling Deep GNN Training with Pipelined Model Parallelism [10.723541176359452]
Communication is a key bottleneck for distributed graph neural network (GNN) training. GNNPipe is a new approach that scales the distributed full-graph deep GNN training.
arXiv Detail & Related papers (2023-08-19T18:44:14Z)
Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism. We implement and evaluate three common classes of compression algorithms. We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z)
Parallel Training of Deep Networks with Local Updates [84.30918922367442]
Local parallelism is a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation. We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
arXiv Detail & Related papers (2020-12-07T16:38:45Z)
Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods. We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods. Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.