Prism: Spectral-Aware Block-Sparse Attention
- URL: http://arxiv.org/abs/2602.08426v1
- Date: Mon, 09 Feb 2026 09:31:06 GMT
- Title: Prism: Spectral-Aware Block-Sparse Attention
- Authors: Xinghao Wang, Pengyu Wang, Xiaoran Liu, Fangxu Liu, Jason Chu, Kai Song, Xipeng Qiu,
- Abstract summary: Existing methods typically employ coarse-grained attention as a proxy for block importance estimation.<n>Mean pooling acts as a low-pass filter that induces destructive interference in high-frequency dimensions.<n>We introduce Prism, a training-free spectral-aware approach that decomposes block selection into high-frequency and low-frequency branches.
- Score: 46.31167787304103
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Block-sparse attention is promising for accelerating long-context LLM pre-filling, yet identifying relevant blocks efficiently remains a bottleneck. Existing methods typically employ coarse-grained attention as a proxy for block importance estimation, but often resort to expensive token-level searching or scoring, resulting in significant selection overhead. In this work, we trace the inaccuracy of standard coarse-grained attention via mean pooling to a theoretical root cause: the interaction between mean pooling and Rotary Positional Embeddings (RoPE). We prove that mean pooling acts as a low-pass filter that induces destructive interference in high-frequency dimensions, effectively creating a "blind spot" for local positional information (e.g., slash patterns). To address this, we introduce Prism, a training-free spectral-aware approach that decomposes block selection into high-frequency and low-frequency branches. By applying energy-based temperature calibration, Prism restores the attenuated positional signals directly from pooled representations, enabling block importance estimation using purely block-level operations, thereby improving efficiency. Extensive evaluations confirm that Prism maintains accuracy parity with full attention while delivering up to $\mathbf{5.1\times}$ speedup.
Related papers
- MirrorLA: Reflecting Feature Map for Vision Linear Attention [49.41670925034762]
Linear attention significantly reduces the computational complexity of Transformers from quadratic to linear, yet it consistently lags behind softmax-based attention in performance.<n>We propose MirrorLA, a geometric framework that substitutes passive truncation with active reorientation.<n>MirrorLA achieves state-of-the-art performance across standard benchmarks, demonstrating that strictly linear efficiency can be achieved without compromising representational fidelity.
arXiv Detail & Related papers (2026-02-04T09:14:09Z) - PWAVEP: Purifying Imperceptible Adversarial Perturbations in 3D Point Clouds via Spectral Graph Wavelets [8.051098153943704]
adversarial attacks on 3D point clouds present significant challenges for defenders.<n>We propose a plug-and-play and non-invasive defense mechanism in the spectral domain.<n>We show that the proposed PWAVEP achieves superior accuracy and robustness compared to existing approaches.
arXiv Detail & Related papers (2026-02-03T10:00:04Z) - PISA: Piecewise Sparse Attention Is Wiser for Efficient Diffusion Transformers [37.401543107035046]
Diffusion Transformers are fundamental for video and image generation, but their efficiency is bottlenecked by the quadratic complexity of attention.<n>We propose PISA, a training-free Piecewise Sparse Attention that covers the full attention span with sub-quadratic complexity.
arXiv Detail & Related papers (2026-02-01T07:47:06Z) - Amortized Spectral Kernel Discovery via Prior-Data Fitted Network [0.0]
We introduce an interpretability-driven framework for amortized spectral discovery from pre-trained PFNs with decoupled attention.<n>We propose decoder architectures that map PFN latents to explicit spectral density estimates and corresponding stationary kernels.<n>This yields orders-of-magnitude reductions in inference time compared to optimization-based baselines.
arXiv Detail & Related papers (2026-01-29T13:51:26Z) - DoPE: Denoising Rotary Position Embedding [60.779039511252584]
Rotary Position Embedding (RoPE) in Transformer models has inherent limits that weaken length.<n>We reinterpret the attention map with positional encoding as a noisy feature map, and propose Denoising Positional extrapolation page (DoPE)<n>DoPE is a training-free method based on truncated matrix entropy to detect outlier frequency bands in the feature map.
arXiv Detail & Related papers (2025-11-12T09:32:35Z) - ProxyAttn: Guided Sparse Attention via Representative Heads [59.03412871683236]
We propose ProxyAttn, a training-free sparse attention algorithm that achieves more precise block estimation.<n>We show that ProxyAttn can achieve up to 10.3x attention acceleration and 2.4x prefilling acceleration without significant performance loss.
arXiv Detail & Related papers (2025-09-29T13:10:39Z) - FlashBias: Fast Computation of Attention with Bias [70.44379606190569]
Attention with bias has been widely deployed in vision, language, protein-folding and other advanced scientific models.<n>It disrupts the tightly fused memory-compute pipeline that underlies the speed of accelerators like FlashAttention.<n>This paper presents FlashBias based on the low-rank compressed sensing theory, which can provide fast-exact computation for many widely used attention biases.
arXiv Detail & Related papers (2025-05-17T15:12:50Z) - XAttention: Block Sparse Attention with Antidiagonal Scoring [10.517760961650279]
Long-context Transformer Models (LCTMs) are vital for real-world applications but suffer high computational costs due to attention's quadratic complexity.<n>We introduce XAttention, a plug-and-play framework that dramatically accelerates long-context inference in Transformers models using sparse attention.
arXiv Detail & Related papers (2025-03-20T17:59:58Z) - SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs [10.702409298302547]
SeerAttention learns the block-level attention sparsity from the Large Language Models itself.<n>Inspired by the gating mechanism in Mixture of Experts (MoE), SeerAttention augments the conventional attention with a learnable gate.<n>Our evaluation results demonstrate that SeerAttention achieves better model accuracy and lower latency for long-context pre-filling.
arXiv Detail & Related papers (2024-10-17T07:07:09Z) - Adaptive Low-Pass Filtering using Sliding Window Gaussian Processes [71.23286211775084]
We propose an adaptive low-pass filter based on Gaussian process regression.
We show that the estimation error of the proposed method is uniformly bounded.
arXiv Detail & Related papers (2021-11-05T17:06:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.