Related papers: Rectified SpaAttn: Revisiting Attention Sparsity for Efficient Video Generation

Rectified SpaAttn: Revisiting Attention Sparsity for Efficient Video Generation

URL: http://arxiv.org/abs/2511.19835v1
Date: Tue, 25 Nov 2025 02:03:54 GMT
Title: Rectified SpaAttn: Revisiting Attention Sparsity for Efficient Video Generation
Authors: Xuewen Liu, Zhikai Li, Jing Zhang, Mengjuan Chen, Qingyi Gu,
Abstract summary: Diffusion Transformers dominate video generation, but the quadratic complexity of attention introduces substantial latency.<n> Attention sparsity reduces computational costs by focusing on critical tokens while ignoring non-critical tokens.<n>Existing methods induce systematic biases in attention allocation.<n>We propose Rectified SpaAttn, which rectifies attention allocation with implicit full attention reference.
Score: 22.35209793690791
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion Transformers dominate video generation, but the quadratic complexity of attention computation introduces substantial latency. Attention sparsity reduces computational costs by focusing on critical tokens while ignoring non-critical tokens. However, existing methods suffer from severe performance degradation. In this paper, we revisit attention sparsity and reveal that existing methods induce systematic biases in attention allocation: (1) excessive focus on critical tokens amplifies their attention weights; (2) complete neglect of non-critical tokens causes the loss of relevant attention weights. To address these issues, we propose Rectified SpaAttn, which rectifies attention allocation with implicit full attention reference, thereby enhancing the alignment between sparse and full attention maps. Specifically: (1) for critical tokens, we show that their bias is proportional to the sparse attention weights, with the ratio governed by the amplified weights. Accordingly, we propose Isolated-Pooling Attention Reallocation, which calculates accurate rectification factors by reallocating multimodal pooled weights. (2) for non-critical tokens, recovering attention weights from the pooled query-key yields attention gains but also introduces pooling errors. Therefore, we propose Gain-Aware Pooling Rectification, which ensures that the rectified gain consistently surpasses the induced error. Moreover, we customize and integrate the Rectified SpaAttn kernel using Triton, achieving up to 3.33 and 2.08 times speedups on HunyuanVideo and Wan 2.1, respectively, while maintaining high generation quality. We release Rectified SpaAttn as open-source at https://github.com/BienLuky/Rectified-SpaAttn .

Related papers

A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training [86.64715217940274]
Outliers function jointly with normalization.<n>Outliers serve more as rescale factors rather than contributors.<n>Outliers can be absorbed into learnable parameters or mitigated via explicit gated rescaling.
arXiv Detail & Related papers (2026-01-30T13:29:45Z)
Attention Needs to Focus: A Unified Perspective on Attention Allocation [37.34801068995858]
The Transformer architecture is a cornerstone of modern Large Language Models (LLMs)<n>Standard attention mechanism is plagued by well-documented issues: representational collapse and attention sink.<n>We present a unified perspective, arguing that both can be traced to a common root -- improper attention allocation.
arXiv Detail & Related papers (2026-01-01T08:39:15Z)
Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction [12.740812798007573]
finite memory induces forgetfulness that harms retrieval-intensive tasks.<n>We explore a series of hybrid models that restore direct access to past tokens.<n>We propose a novel learnable token eviction approach.
arXiv Detail & Related papers (2025-10-23T17:53:03Z)
QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification [67.15451442018258]
Diffusion transformers exhibit remarkable video generation capability, yet their prohibitive computational and memory costs hinder practical deployment.<n>Model quantization and attention sparsification are two promising directions for compression, but each alone suffers severe performance degradation under aggressive compression.<n>We propose textbfQuantSparse, a unified framework that integrates model quantization with attention sparsification.
arXiv Detail & Related papers (2025-09-28T06:49:44Z)
FlashBias: Fast Computation of Attention with Bias [70.44379606190569]
Attention with bias has been widely deployed in vision, language, protein-folding and other advanced scientific models.<n>It disrupts the tightly fused memory-compute pipeline that underlies the speed of accelerators like FlashAttention.<n>This paper presents FlashBias based on the low-rank compressed sensing theory, which can provide fast-exact computation for many widely used attention biases.
arXiv Detail & Related papers (2025-05-17T15:12:50Z)
Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction [52.14200610448542]
A transformer has a quadratic complexity, leading to high inference costs and latency for long sequences.<n>We propose a simple, novel, and effective procedure for correcting this distributional shift.<n>Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.
arXiv Detail & Related papers (2025-05-16T13:48:33Z)
TRA: Better Length Generalisation with Threshold Relative Attention [58.64717643300818]
Transformers struggle with length generalisation, displaying poor performance even on basic tasks.<n>We test whether these limitations can be explained through two key failures of the self-attention mechanism.<n>We show how the attention mechanism with these two mitigations in place can substantially improve the generalisation capabilities of decoder only transformers.
arXiv Detail & Related papers (2025-03-29T18:06:28Z)
AttentionPredictor: Temporal Patterns Matter for KV Cache Compression [64.75459635661562]
We propose AttentionPredictor, which is the first learning-based method to directly predict attention patterns for KV cache compression and critical token identification.<n> AttentionPredictor accurately predicts the attention score and shares the unified prediction model, which consumes negligible memory.<n>By retaining most of the attention information, AttentionPredictor achieves 13$times$ KV cache compression and 5.6$times$ speedup in a cache offloading scenario.
arXiv Detail & Related papers (2025-02-06T13:41:46Z)
Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study [38.492552119793]
We investigate an alternative attention mechanism based on the stick-breaking process in larger scale settings.<n>We study the implications of replacing the conventional softmax-based attention mechanism with stick-breaking attention.<n>When used as a drop-in replacement for current softmax+RoPE attention systems, we find that stick-breaking attention performs competitively with current methods.
arXiv Detail & Related papers (2024-10-23T15:51:13Z)
When Attention Sink Emerges in Language Models: An Empirical View [39.36282162213973]
Language Models (LMs) assign significant attention to the first token, even if it is not semantically important.<n>This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others.<n>We first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models.
arXiv Detail & Related papers (2024-10-14T17:50:28Z)
Robustifying Token Attention for Vision Transformers [72.07710236246285]
Vision transformers (ViTs) still suffer from significant drops in accuracy in the presence of common corruptions. We propose two techniques to make attention more stable through two general techniques. First, our Token-aware Average Pooling (TAP) module encourages the local neighborhood of each token to take part in the attention mechanism. Second, we force the output tokens to aggregate information from a diverse set of input tokens rather than focusing on just a few.
arXiv Detail & Related papers (2023-03-20T14:04:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.