Related papers: A Unified Sparse Attention via Multi-Granularity Compression

A Unified Sparse Attention via Multi-Granularity Compression

URL: http://arxiv.org/abs/2512.14082v1
Date: Tue, 16 Dec 2025 04:42:31 GMT
Title: A Unified Sparse Attention via Multi-Granularity Compression
Authors: Siran Liu, Zane Cao, Yongchao He,
Abstract summary: We present UniSparse, a unified mechanism that introduces the notion of composite tokens--compact representations that aggregate multi-granularity contextual information.<n>Across multiple modalities and tasks, UniSparse consistently surpasses state-of-the-art sparse attention methods in both accuracy and efficiency.
Score: 0.6848057161210613
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Efficient long-context understanding and reasoning are increasingly vital for large language model (LLM) applications such as multi-turn dialogue and program analysis. However, the core self-attention mechanism scales quadratically with sequence length, creating a fundamental computational bottleneck. Existing sparse attention methods alleviate this issue but face trade-offs: training-based methods are costly and cannot be directly applied as acceleration plugins for other models, while inference-time methods often compromise efficiency or cross-modal generality. To address these limitations, we present UniSparse, a unified mechanism that introduces the notion of composite tokens--compact representations that aggregate multi-granularity contextual information. Building on this abstraction, UniSparse dynamically constructs sparse attention through multi-granularity compression and block-level selection, enabling efficient and hardware-friendly execution on GPU. Across multiple modalities and tasks ranging from synthetic benchmarks to real-world applications, UniSparse consistently surpasses state-of-the-art sparse attention methods (e.g., MInference, XAttention, FlexPrefill) in both accuracy and efficiency, achieving $\ge$ 99% of full-attention accuracy and up to 2.61$\times$ faster attention computation than FlashAttention.

Related papers

Training-free Context-adaptive Attention for Efficient Long Context Modeling [57.703159205740185]
Training-free Context-adaptive Attention (TCA-Attention) is a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference.<n>TCA-Attention achieves a 2.8$times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention.
arXiv Detail & Related papers (2025-12-10T01:54:57Z)
Trainable Dynamic Mask Sparse Attention [11.506985057671015]
We introduce a trainable dynamic mask sparse attention mechanism, a method that merges the advantages of both position-aware and content-aware approaches.<n>We demonstrate that the introduced dynamic mask and sparse weights do not obstruct gradients, supporting end-to-end training.
arXiv Detail & Related papers (2025-08-04T07:05:15Z)
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios [44.01959390088695]
Multimodal large language models (MLLMs) process increasingly long and complex contexts.<n> token compression has emerged as an auspicious and critical approach, efficiently reducing the number of tokens during both training and inference.<n>We present the first systematic survey and synthesis of the burgeoning field of multimodal long context token compression.
arXiv Detail & Related papers (2025-07-27T09:33:56Z)
Efficient Attention Mechanisms for Large Language Models: A Survey [18.86171225316892]
Transformer-based architectures have become the prevailing computation backbone of large language models.<n>Recent research has introduced two principal categories of efficient attention mechanisms.<n>Sparse attention techniques, in contrast, limit attention to selected subsets of tokens based on fixed patterns, block-wise routing, or clustering strategies.
arXiv Detail & Related papers (2025-07-25T18:08:10Z)
Modality Agnostic Efficient Long Range Encoder [14.705955027331674]
We address the challenge of long-context processing on a single device using generic implementations.<n>To overcome these limitations, we propose MAELRE, a unified and efficient transformer architecture.<n>We demonstrate that MAELRE achieves superior accuracy while reducing computational cost compared to existing long-context models.
arXiv Detail & Related papers (2025-07-25T16:19:47Z)
ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models [67.75439511654078]
Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses.<n>They face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications.<n>We propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment.
arXiv Detail & Related papers (2025-07-01T16:01:08Z)
FA-Seg: A Fast and Accurate Diffusion-Based Method for Open-Vocabulary Segmentation [1.4525238046020867]
Open-vocabulary semantic segmentation aims to segment objects from arbitrary text categories without requiring densely annotated datasets.<n>We present FA-Seg, a training-free framework for open-vocabulary segmentation based on diffusion models.
arXiv Detail & Related papers (2025-06-29T16:41:41Z)
PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention [73.26995918610669]
Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts.<n>We introduce PowerAttention, a novel sparse attention design that facilitates effective and complete context extension.<n>Experiments demonstrate that PowerAttention outperforms existing static sparse attention methods by $5sim 40%$.
arXiv Detail & Related papers (2025-03-05T15:24:11Z)
PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection [68.8373788348678]
Visual instruction tuning adapts pre-trained Multimodal Large Language Models to follow human instructions.<n>PRISM is the first training-free framework for efficient visual instruction selection.<n>It reduces the end-to-end time for data selection and model tuning to just 30% of conventional pipelines.
arXiv Detail & Related papers (2025-02-17T18:43:41Z)
Core Context Aware Transformers for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-context modeling.<n>Our method automatically focuses and strengthens core context while diminishing redundancy during the learning process.<n>Our method is able to replace the self-attention module in existing Large Language Models with minimal fine-tuning cost.
arXiv Detail & Related papers (2024-12-17T01:54:08Z)
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.