ProxyAttn: Guided Sparse Attention via Representative Heads
- URL: http://arxiv.org/abs/2509.24745v1
- Date: Mon, 29 Sep 2025 13:10:39 GMT
- Title: ProxyAttn: Guided Sparse Attention via Representative Heads
- Authors: Yixuan Wang, Huang He, Siqi Bao, Hua Wu, Haifeng Wang, Qingfu Zhu, Wanxiang Che,
- Abstract summary: We propose ProxyAttn, a training-free sparse attention algorithm that achieves more precise block estimation.<n>We show that ProxyAttn can achieve up to 10.3x attention acceleration and 2.4x prefilling acceleration without significant performance loss.
- Score: 59.03412871683236
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The quadratic complexity of attention mechanisms limits the efficiency of Large Language Models (LLMs) on long-text tasks. Recently, methods that dynamically estimate block importance have enabled efficient block sparse attention, leading to significant acceleration in long-text pre-filling of LLMs. However, their coarse-grained estimation inevitably leads to performance degradation at high sparsity rates. In this work, we propose ProxyAttn, a training-free sparse attention algorithm that achieves more precise block estimation by compressing the dimension of attention heads. Based on our observation of the similarity among multiple attention heads, we use the scores of pooled representative heads to approximate the scores for all heads. To account for the varying sparsity among heads, we also propose a block-aware dynamic budget estimation method. By combining the scores from representative proxy heads with multi-head dynamic budgets, we achieve a more fine-grained block importance evaluation at low computational cost. Experiments on a variety of mainstream models and extensive benchmarks confirm the underlying similarity among attention heads. Leveraging a fine-grained estimation, the proposed method achieves substantial gains in performance and efficiency compared to existing methods. More precisely, ProxyAttn can achieve up to 10.3x attention acceleration and 2.4x prefilling acceleration without significant performance loss. Our code is available at https://github.com/wyxstriker/ProxyAttn.
Related papers
- Agentic Test-Time Scaling for WebAgents [65.5178428849495]
We present Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are genuinely contentious.<n>CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% over React while using up to 2.3x fewer tokens than uniform scaling.
arXiv Detail & Related papers (2026-02-12T18:58:30Z) - LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding [27.856769454125573]
Long-context large language models (LLMs) expose a key bottleneck: the rapidly expanding key-value cache during decoding.<n>We propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism.<n>We demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing the full-attention baseline.
arXiv Detail & Related papers (2026-02-04T13:34:12Z) - Efficient Thought Space Exploration through Strategic Intervention [54.35208611253168]
We propose a novel Hint-Practice Reasoning (HPR) framework that operationalizes this insight through two synergistic components.<n>The framework's core innovation lies in Distributional Inconsistency Reduction (DIR), which dynamically identifies intervention points.<n> Experiments across arithmetic and commonsense reasoning benchmarks demonstrate HPR's state-of-the-art efficiency-accuracy tradeoffs.
arXiv Detail & Related papers (2025-11-13T07:26:01Z) - SpecAttn: Speculating Sparse Attention [1.6921396880325779]
We introduce SpecAttn, a novel training-free approach that seamlessly integrates with speculative decoding techniques.<n>Our key insight is to exploit the attention weights already computed by the draft model during speculative decoding to identify important tokens for the target model.<n>SpecAttn achieves over 75% reduction in key-value cache accesses with a mere 15.29% increase in perplexity on the PG-19 dataset.
arXiv Detail & Related papers (2025-10-31T17:12:34Z) - DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning [6.468843780300177]
We present textbfDELTA, a training-free sparse attention mechanism that achieves computational efficiency without sacrificing model accuracy.<n>Our results show that selective reuse of intermediate attention maps offers a robust path toward efficient long-context reasoning.
arXiv Detail & Related papers (2025-10-10T21:37:49Z) - SAS: Simulated Attention Score [75.1409882298863]
We introduce Simulated Attention Score (SAS), which maintains a compact model size while simulating a larger number of attention heads and hidden feature dimension per head.<n>Comprehensive experiments on a variety of datasets and tasks demonstrate the effectiveness of the proposed SAS method.
arXiv Detail & Related papers (2025-07-10T12:16:16Z) - Multipole Attention for Efficient Long Context Reasoning [64.94673641704289]
Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks.<n>LRMs need to generate long chain-of-thought reasoning in order to think before answering.<n>We introduce Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens.
arXiv Detail & Related papers (2025-06-16T03:00:40Z) - Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing [4.7924863950812995]
Sparse attention methods exploit the inherent sparsity in attention to speed up the prefilling phase of long-context inference.<n>We propose a highly accurate sparse attention mechanism that shares similar yet precise attention patterns across heads.<n>Our method effectively captures actual patterns while requiring full attention for only a small subset of heads.
arXiv Detail & Related papers (2025-05-26T06:48:53Z) - ProgRoCC: A Progressive Approach to Rough Crowd Counting [66.09510514180593]
We label Rough Crowd Counting that delivers better accuracy on the basis of training data that is easier to acquire.<n>We propose an approach to the rough crowd counting problem based on CLIP, termed ProgRoCC.<n>Specifically, we introduce a progressive estimation learning strategy that determines the object count through a coarse-to-fine approach.
arXiv Detail & Related papers (2025-04-18T01:57:42Z) - XAttention: Block Sparse Attention with Antidiagonal Scoring [10.517760961650279]
Long-context Transformer Models (LCTMs) are vital for real-world applications but suffer high computational costs due to attention's quadratic complexity.<n>We introduce XAttention, a plug-and-play framework that dramatically accelerates long-context inference in Transformers models using sparse attention.
arXiv Detail & Related papers (2025-03-20T17:59:58Z) - ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference [41.41316718220569]
ExpertFlow is designed to enhance inference efficiency by accommodating flexible routing and enabling efficient expert scheduling between CPU and GPU.
Our experiments demonstrate that ExpertFlow achieves up to 93.72% GPU memory savings and enhances inference speed by 2 to 10 times compared to baseline methods.
arXiv Detail & Related papers (2024-10-23T15:24:54Z) - Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference [1.0919012968294923]
We introduce a novel algorithm-architecture co-design approach that accelerates transformers using head sparsity, block sparsity and approximation opportunities to reduce computations in attention and reduce memory access.
With the observation of the huge redundancy in attention scores and attention heads, we propose a novel integer-based row-balanced block pruning to prune unimportant blocks in the attention matrix at run time.
Also propose integer-based head pruning to detect and prune unimportant heads at an early stage at run time.
arXiv Detail & Related papers (2024-07-17T11:15:16Z) - DIR-AS: Decoupling Individual Identification and Temporal Reasoning for
Action Segmentation [84.78383981697377]
Fully supervised action segmentation works on frame-wise action recognition with dense annotations and often suffers from the over-segmentation issue.
We develop a novel local-global attention mechanism with temporal pyramid dilation and temporal pyramid pooling for efficient multi-scale attention.
We achieve state-of-the-art accuracy, eg, 82.8% (+2.6%) on GTEA and 74.7% (+1.2%) on Breakfast, which demonstrates the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-04-04T20:27:18Z) - Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples.
We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment.
We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.