Related papers: Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models

Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models

URL: http://arxiv.org/abs/2601.15305v1
Date: Mon, 12 Jan 2026 20:33:39 GMT
Title: Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models
Authors: Alfred Shen, Aaron Shen,
Abstract summary: Gated Sparse Attention (GSA) is an architecture that realizes the benefits of both sparse and gated attention.<n>GSA incorporates a gated lightning indexer with sigmoid activations that produce bounded, interpretable selection scores.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The computational burden of attention in long-context language models has motivated two largely independent lines of work: sparse attention mechanisms that reduce complexity by attending to selected tokens, and gated attention variants that improve training sta-bility while mitigating the attention sink phenomenon. We observe that these approaches address complementary weaknesses and propose Gated Sparse Attention (GSA), an architecture that realizes the benefits of both. GSA incorporates a gated lightning indexer with sigmoid activations that produce bounded, interpretable selection scores, an adaptive sparsity controller that modulates the number of attended tokens based on local uncertainty, and dual gating at the value and output stages. We establish theoretical foundations for the approach, including complexity analysis, expressiveness results, and convergence guarantees. In experiments with 1.7B parameter models trained on 400B tokens, GSA matches the efficiency of sparse-only baselines (12-16x speedup at 128K context) while achieving the quality gains associated with gated attention: perplexity improves from 6.03 to 5.70, RULER scores at 128K context nearly double, and attention to the first token, a proxy for attention sinks, drops from 47% to under 4%. Training stability improves markedly, with loss spikes reduced by 98%.

Related papers

Chain of Simulation: A Dual-Mode Reasoning Framework for Large Language Models with Dynamic Problem Routing [0.0]
Chain of Simulation (CoS) is a novel dual-mode reasoning framework that dynamically routes problems to specialized reasoning strategies.<n>CoS employs three distinct reasoning modes: computational flow with self-consistency for mathematical problems, symbolic state tracking with representations for spatial reasoning, and hybrid fact-extraction for multi-hop inference.
arXiv Detail & Related papers (2026-02-02T21:44:01Z)
Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models [44.28116882776357]
We present textbfPunctuation-aware textbfHybrid textbfSparse textbfAttention textbf(PHSA), a trainable sparse attention framework that leverages punctuation tokens as semantic boundary anchors.<n>Specifically, we design a dual-branch aggregation mechanism that fuses global semantic representations with punctuation-enhanced boundary features, preserving the core semantic structure while introducing almost no additional computational overhead.
arXiv Detail & Related papers (2026-01-06T08:47:16Z)
D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning [49.16227597771663]
D2Pruner is a framework that combines debiased importance with a structural pruning mechanism.<n>It reduces FLOPs by 74.2% while retaining 99.2% of its original performance.<n>It marks a significant advancement with up to 63. 53% improvement over existing methods.
arXiv Detail & Related papers (2025-12-22T14:42:31Z)
Training-free Context-adaptive Attention for Efficient Long Context Modeling [57.703159205740185]
Training-free Context-adaptive Attention (TCA-Attention) is a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference.<n>TCA-Attention achieves a 2.8$times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention.
arXiv Detail & Related papers (2025-12-10T01:54:57Z)
Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning [73.10669391954801]
We present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0.<n>Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention.<n>Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%.
arXiv Detail & Related papers (2025-10-22T07:59:38Z)
ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism [10.913346263482786]
We introduce an entropy-based mechanism to enhance the exploration-exploitation balance in test-time reinforcement learning.<n>Compared with the baseline, our approach enables Llama3.1-8B to achieve a 68 percent relative improvement in Pass at 1 metric.
arXiv Detail & Related papers (2025-08-15T09:49:14Z)
DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference [19.987309147268586]
We present DeltaLLM, a training-free framework that exploits temporal sparsity in attention patterns to enable efficient LLM inference on resource-constrained edge devices.<n>We evaluate our framework on the edge-device-friendly BitNet-b1.58-2B-4T model and Llama3.2-1B-Instruct model across diverse language tasks.
arXiv Detail & Related papers (2025-07-25T18:23:18Z)
ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning [57.67273340380651]
Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks.<n>These results highlight ASDA's effectiveness in audio tasks, paving the way for broader applications.
arXiv Detail & Related papers (2025-07-03T14:29:43Z)
Learning Adaptive Parallel Reasoning with Language Models [70.1745752819628]
We propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end.<n> APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations.<n>A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures.
arXiv Detail & Related papers (2025-04-21T22:29:02Z)
Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.