Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models
- URL: http://arxiv.org/abs/2601.02819v1
- Date: Tue, 06 Jan 2026 08:47:16 GMT
- Title: Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models
- Authors: Junxiang Qiu, Shuo Wang, Zhengsu Chen, Hengheng Zhang, Jinda Lu, Changcheng Li, Qi Tian,
- Abstract summary: We present textbfPunctuation-aware textbfHybrid textbfSparse textbfAttention textbf(PHSA), a trainable sparse attention framework that leverages punctuation tokens as semantic boundary anchors.<n>Specifically, we design a dual-branch aggregation mechanism that fuses global semantic representations with punctuation-enhanced boundary features, preserving the core semantic structure while introducing almost no additional computational overhead.
- Score: 44.28116882776357
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention serves as the fundamental mechanism for long-context modeling in large language models (LLMs), yet dense attention becomes structurally prohibitive for long sequences due to its quadratic complexity. Consequently, sparse attention has received increasing attention as a scalable alternative. However, existing sparse attention methods rely on coarse-grained semantic representations during block selection, which blur intra-block semantic boundaries and lead to the loss of critical information. To address this issue, we propose \textbf{P}unctuation-aware \textbf{H}ybrid \textbf{S}parse \textbf{A}ttention \textbf{(PHSA)}, a natively trainable sparse attention framework that leverages punctuation tokens as semantic boundary anchors. Specifically, (1) we design a dual-branch aggregation mechanism that fuses global semantic representations with punctuation-enhanced boundary features, preserving the core semantic structure while introducing almost no additional computational overhead; (2) we introduce an extreme-sparsity-adaptive training and inference strategy that stabilizes model behavior under very low token activation ratios; Extensive experiments on general benchmarks and long-context evaluations demonstrate that PHSA consistently outperforms dense attention and state-of-the-art sparse attention baselines, including InfLLM v2. Specifically, for the 0.6B-parameter model with 32k-token input sequences, PHSA can reduce the information loss by 10.8\% at a sparsity ratio of 97.3\%.
Related papers
- Attention in Constant Time: Vashista Sparse Attention for Long-Context Decoding with Exponential Guarantees [0.0]
Large language models spend most of their inference cost on attention over long contexts.<n>We formalize this phenomenon by modeling attention as a projection onto the convex hull of key vectors.<n>We introduce Vashista Sparse Attention, a drop-in mechanism that maintains a small candidate set per query.
arXiv Detail & Related papers (2026-02-14T14:29:10Z) - Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping [61.459927600301654]
Multi-condition control is bottlenecked by the conventional concatenate-and-attend'' strategy.<n>Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant.<n>We propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies.
arXiv Detail & Related papers (2026-02-06T16:39:10Z) - D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning [49.16227597771663]
D2Pruner is a framework that combines debiased importance with a structural pruning mechanism.<n>It reduces FLOPs by 74.2% while retaining 99.2% of its original performance.<n>It marks a significant advancement with up to 63. 53% improvement over existing methods.
arXiv Detail & Related papers (2025-12-22T14:42:31Z) - Training-free Context-adaptive Attention for Efficient Long Context Modeling [57.703159205740185]
Training-free Context-adaptive Attention (TCA-Attention) is a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference.<n>TCA-Attention achieves a 2.8$times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention.
arXiv Detail & Related papers (2025-12-10T01:54:57Z) - STORE: Semantic Tokenization, Orthogonal Rotation and Efficient Attention for Scaling Up Ranking Models [11.965535230928372]
Store is a unified and scalable token-based ranking framework built upon three core innovations.<n>Our framework consistently improves prediction accuracy(online CTR by 2.71%, AUC by 1.195%) and training effeciency (1.84 throughput)
arXiv Detail & Related papers (2025-11-24T06:20:02Z) - Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition [41.77490816513839]
We propose a novel method for zero-shot skeleton action recognition, termed $texttt$textbfFlora$$.<n>Specifically, we attune textual semantics by incorporating direction-aware regional semantics, and a cross-modal consistency objective.<n>Experiments on three benchmark datasets validate the effectiveness of our method, showing particularly impressive performance even when trained with only 10% of the seen data.
arXiv Detail & Related papers (2025-11-12T14:54:53Z) - DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models [55.30555646945055]
Text-to-Image (T2I) models are vulnerable to semantic leakage.<n>We introduce DeLeaker, a lightweight approach that mitigates leakage by directly intervening on the model's attention maps.<n>SLIM is the first dataset dedicated to semantic leakage.
arXiv Detail & Related papers (2025-10-16T17:39:21Z) - SIM-CoT: Supervised Implicit Chain-of-Thought [108.30049193668083]
Implicit Chain-of-Thought (CoT) methods offer a token-efficient alternative to explicit CoT reasoning in Large Language Models.<n>We identify a core latent instability issue when scaling the computational budget of implicit CoT.<n>We propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space.
arXiv Detail & Related papers (2025-09-24T17:01:32Z) - AnchorAttention: Difference-Aware Sparse Attention with Stripe Granularity [9.63873831179673]
Large Language Models (LLMs) with extended context lengths face significant computational challenges during the pre-filling phase.<n>We propose textbfAnchorAttention, a difference-aware, dynamic sparse attention mechanism that efficiently identifies critical attention regions.<n>With its finer-grained sparsity strategy, textbfAnchorAttention achieves higher sparsity rates at the same recall level, significantly reducing computation time.
arXiv Detail & Related papers (2025-05-29T14:59:06Z) - Discrete Variational Attention Models for Language Generation [51.88612022940496]
We propose a discrete variational attention model with categorical distribution over the attention mechanism owing to the discrete nature in languages.
Thanks to the property of discreteness, the training of our proposed approach does not suffer from posterior collapse.
arXiv Detail & Related papers (2020-04-21T05:49:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.