MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
- URL: http://arxiv.org/abs/2510.18830v1
- Date: Tue, 21 Oct 2025 17:25:32 GMT
- Title: MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
- Authors: Wenxuan Li, Chengruidong Zhang, Huiqiang Jiang, Yucheng Li, Yuqing Yang, Lili Qiu,
- Abstract summary: MTraining is a distributed methodology for training Large Language Models with ultra-long contexts.<n>MTraining integrates a dynamic sparse training pattern, balanced sparse ring attention, and hierarchical sparse ring attention.<n>MTraining achieves up to a 6x higher training throughput while preserving model accuracy.
- Score: 23.925430484357975
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The adoption of long context windows has become a standard feature in Large Language Models (LLMs), as extended contexts significantly enhance their capacity for complex reasoning and broaden their applicability across diverse scenarios. Dynamic sparse attention is a promising approach for reducing the computational cost of long-context. However, efficiently training LLMs with dynamic sparse attention on ultra-long contexts-especially in distributed settings-remains a significant challenge, due in large part to worker- and step-level imbalance. This paper introduces MTraining, a novel distributed methodology leveraging dynamic sparse attention to enable efficient training for LLMs with ultra-long contexts. Specifically, MTraining integrates three key components: a dynamic sparse training pattern, balanced sparse ring attention, and hierarchical sparse ring attention. These components are designed to synergistically address the computational imbalance and communication overheads inherent in dynamic sparse attention mechanisms during the training of models with extensive context lengths. We demonstrate the efficacy of MTraining by training Qwen2.5-3B, successfully expanding its context window from 32K to 512K tokens on a cluster of 32 A100 GPUs. Our evaluations on a comprehensive suite of downstream tasks, including RULER, PG-19, InfiniteBench, and Needle In A Haystack, reveal that MTraining achieves up to a 6x higher training throughput while preserving model accuracy. Our code is available at https://github.com/microsoft/MInference/tree/main/MTraining.
Related papers
- Predicting Task Performance with Context-aware Scaling Laws [56.6850444554434]
We propose a straightforward, interpretable framework that jointly models downstream performance as a function of the training compute and the provided context.<n>We empirically validate our framework by fitting it on the observed downstream performance of extended-context variants of Llama-2-7B and Llama-2-13B.<n>Our results demonstrate that our framework accurately models in-distribution downstream performance, generalizes across three orders of magnitude in training compute, and reliably extrapolates performance as the amount of context increases.
arXiv Detail & Related papers (2025-10-16T17:35:18Z) - Trainable Dynamic Mask Sparse Attention [11.506985057671015]
We introduce a trainable dynamic mask sparse attention mechanism, a method that merges the advantages of both position-aware and content-aware approaches.<n>We demonstrate that the introduced dynamic mask and sparse weights do not obstruct gradients, supporting end-to-end training.
arXiv Detail & Related papers (2025-08-04T07:05:15Z) - DeepInsert: Early Layer Bypass for Efficient and Performant Multimodal Understanding [26.39397960987363]
We propose a simple modification to pretrained transformer models.<n>Instead of concatenation with the language prompt at the start, we insert multimodal tokens directly into the middle.<n>Our results indicate that our method reduces computational costs during both training and inference.
arXiv Detail & Related papers (2025-04-27T18:56:26Z) - From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models [54.44375226381814]
Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling.<n>We introduce a efficient training recipe for building ultra-long context LLMs from aligned instruct model, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens.<n>Our approach achieves state-of-the-art performance across a diverse set of long-context benchmarks.
arXiv Detail & Related papers (2025-04-08T16:58:58Z) - PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention [73.26995918610669]
Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts.<n>We introduce PowerAttention, a novel sparse attention design that facilitates effective and complete context extension.<n>Experiments demonstrate that PowerAttention outperforms existing static sparse attention methods by $5sim 40%$.
arXiv Detail & Related papers (2025-03-05T15:24:11Z) - InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU [48.105361428245736]
We introduce InfiniteHiP, an inference framework for large language models (LLMs)<n>We dynamically eliminate irrelevant context tokens through a modular hierarchical token pruning algorithm.<n>Our framework achieves an 18.95x speedup in attention decoding for a 1 million token context without requiring additional training.
arXiv Detail & Related papers (2025-02-13T02:52:01Z) - Untie the Knots: An Efficient Data Augmentation Strategy for Long-Context Pre-Training in Language Models [21.90388980448712]
Training models to handle long contexts presents significant challenges.
We introduce Untie the Knots (textbfUtK), a novel data augmentation strategy employed during the continue pre-training phase.
We conduct extensive experiments on models with 7B and 72B parameters, trained on 20 billion tokens, demonstrating that UtK achieves 75% and 84.5% accurracy on RULER at 128K context length.
arXiv Detail & Related papers (2024-09-07T09:28:55Z) - LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models [72.71150585370147]
LongRecipe is an efficient training strategy for extending the context window of large language models.
It simulates long-sequence inputs while maintaining training efficiency and significantly improves the model's understanding of long-range dependencies.
LongRecipe can utilize long sequences while requiring only 30% of the target context window size, and reduces computational training resource over 85% compared to full sequence training.
arXiv Detail & Related papers (2024-08-31T17:19:30Z) - KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches [52.02764371205856]
Long context capability is a crucial competency for large language models (LLMs)
This work provides a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks.
arXiv Detail & Related papers (2024-07-01T17:59:47Z) - Training-Free Long-Context Scaling of Large Language Models [114.53296002607993]
We propose Dual Chunk Attention, which enables Llama2 70B to support context windows of more than 100k tokens without continual training.
By decomposing the attention for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens.
arXiv Detail & Related papers (2024-02-27T12:39:23Z) - Towards Structured Dynamic Sparse Pre-Training of BERT [4.567122178196833]
We develop and study a straightforward, dynamic always-sparse pre-training approach for BERT language modeling task.
We demonstrate that training remains FLOP-efficient when using coarse-grained block sparsity, making it particularly promising for efficient execution on modern hardware accelerators.
arXiv Detail & Related papers (2021-08-13T14:54:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.