Related papers: Prism: Spectral-Aware Block-Sparse Attention

Prism: Spectral-Aware Block-Sparse Attention

URL: http://arxiv.org/abs/2602.08426v1
Date: Mon, 09 Feb 2026 09:31:06 GMT
Title: Prism: Spectral-Aware Block-Sparse Attention
Authors: Xinghao Wang, Pengyu Wang, Xiaoran Liu, Fangxu Liu, Jason Chu, Kai Song, Xipeng Qiu,
Abstract summary: Existing methods typically employ coarse-grained attention as a proxy for block importance estimation.<n>Mean pooling acts as a low-pass filter that induces destructive interference in high-frequency dimensions.<n>We introduce Prism, a training-free spectral-aware approach that decomposes block selection into high-frequency and low-frequency branches.
Score: 46.31167787304103
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Block-sparse attention is promising for accelerating long-context LLM pre-filling, yet identifying relevant blocks efficiently remains a bottleneck. Existing methods typically employ coarse-grained attention as a proxy for block importance estimation, but often resort to expensive token-level searching or scoring, resulting in significant selection overhead. In this work, we trace the inaccuracy of standard coarse-grained attention via mean pooling to a theoretical root cause: the interaction between mean pooling and Rotary Positional Embeddings (RoPE). We prove that mean pooling acts as a low-pass filter that induces destructive interference in high-frequency dimensions, effectively creating a "blind spot" for local positional information (e.g., slash patterns). To address this, we introduce Prism, a training-free spectral-aware approach that decomposes block selection into high-frequency and low-frequency branches. By applying energy-based temperature calibration, Prism restores the attenuated positional signals directly from pooled representations, enabling block importance estimation using purely block-level operations, thereby improving efficiency. Extensive evaluations confirm that Prism maintains accuracy parity with full attention while delivering up to $\mathbf{5.1\times}$ speedup.

Related papers

MirrorLA: Reflecting Feature Map for Vision Linear Attention [49.41670925034762]
Linear attention significantly reduces the computational complexity of Transformers from quadratic to linear, yet it consistently lags behind softmax-based attention in performance.<n>We propose MirrorLA, a geometric framework that substitutes passive truncation with active reorientation.<n>MirrorLA achieves state-of-the-art performance across standard benchmarks, demonstrating that strictly linear efficiency can be achieved without compromising representational fidelity.
arXiv Detail & Related papers (2026-02-04T09:14:09Z)
PWAVEP: Purifying Imperceptible Adversarial Perturbations in 3D Point Clouds via Spectral Graph Wavelets [8.051098153943704]
adversarial attacks on 3D point clouds present significant challenges for defenders.<n>We propose a plug-and-play and non-invasive defense mechanism in the spectral domain.<n>We show that the proposed PWAVEP achieves superior accuracy and robustness compared to existing approaches.
arXiv Detail & Related papers (2026-02-03T10:00:04Z)
PISA: Piecewise Sparse Attention Is Wiser for Efficient Diffusion Transformers [37.401543107035046]
Diffusion Transformers are fundamental for video and image generation, but their efficiency is bottlenecked by the quadratic complexity of attention.<n>We propose PISA, a training-free Piecewise Sparse Attention that covers the full attention span with sub-quadratic complexity.
arXiv Detail & Related papers (2026-02-01T07:47:06Z)
Amortized Spectral Kernel Discovery via Prior-Data Fitted Network [0.0]
We introduce an interpretability-driven framework for amortized spectral discovery from pre-trained PFNs with decoupled attention.<n>We propose decoder architectures that map PFN latents to explicit spectral density estimates and corresponding stationary kernels.<n>This yields orders-of-magnitude reductions in inference time compared to optimization-based baselines.
arXiv Detail & Related papers (2026-01-29T13:51:26Z)
DoPE: Denoising Rotary Position Embedding [60.779039511252584]
Rotary Position Embedding (RoPE) in Transformer models has inherent limits that weaken length.<n>We reinterpret the attention map with positional encoding as a noisy feature map, and propose Denoising Positional extrapolation page (DoPE)<n>DoPE is a training-free method based on truncated matrix entropy to detect outlier frequency bands in the feature map.
arXiv Detail & Related papers (2025-11-12T09:32:35Z)
ProxyAttn: Guided Sparse Attention via Representative Heads [59.03412871683236]
We propose ProxyAttn, a training-free sparse attention algorithm that achieves more precise block estimation.<n>We show that ProxyAttn can achieve up to 10.3x attention acceleration and 2.4x prefilling acceleration without significant performance loss.
arXiv Detail & Related papers (2025-09-29T13:10:39Z)
FlashBias: Fast Computation of Attention with Bias [70.44379606190569]
Attention with bias has been widely deployed in vision, language, protein-folding and other advanced scientific models.<n>It disrupts the tightly fused memory-compute pipeline that underlies the speed of accelerators like FlashAttention.<n>This paper presents FlashBias based on the low-rank compressed sensing theory, which can provide fast-exact computation for many widely used attention biases.
arXiv Detail & Related papers (2025-05-17T15:12:50Z)
XAttention: Block Sparse Attention with Antidiagonal Scoring [10.517760961650279]
Long-context Transformer Models (LCTMs) are vital for real-world applications but suffer high computational costs due to attention's quadratic complexity.<n>We introduce XAttention, a plug-and-play framework that dramatically accelerates long-context inference in Transformers models using sparse attention.
arXiv Detail & Related papers (2025-03-20T17:59:58Z)
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs [10.702409298302547]
SeerAttention learns the block-level attention sparsity from the Large Language Models itself.<n>Inspired by the gating mechanism in Mixture of Experts (MoE), SeerAttention augments the conventional attention with a learnable gate.<n>Our evaluation results demonstrate that SeerAttention achieves better model accuracy and lower latency for long-context pre-filling.
arXiv Detail & Related papers (2024-10-17T07:07:09Z)
Adaptive Low-Pass Filtering using Sliding Window Gaussian Processes [71.23286211775084]
We propose an adaptive low-pass filter based on Gaussian process regression. We show that the estimation error of the proposed method is uniformly bounded.
arXiv Detail & Related papers (2021-11-05T17:06:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.