Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape
- URL: http://arxiv.org/abs/2505.22918v4
- Date: Tue, 28 Oct 2025 21:55:57 GMT
- Title: Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape
- Authors: Ruichen Chen, Keith G. Mills, Liyao Jiang, Chao Gao, Di Niu,
- Abstract summary: A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length.<n>Existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads.<n>We propose Re-ttention, which implements very high sparse attention for visual generation models.
- Score: 38.76559841681518
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference.
Related papers
- TIMERIPPLE: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space [15.535854202219072]
We take a principled approach to accelerate self-attention in vDiTs by leveraging the reuse-temporal correlations in the latent space.<n>We show that the attention patterns within vDiT are primarily due to the dominant spatial and temporal correlations at the token channel level.<n>We propose a lightweight and adaptive strategy that approximates attention computations by reusing partial attention scores spatially or temporally correlated tokens along individual channels.
arXiv Detail & Related papers (2025-11-15T05:07:31Z) - Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation [21.87891961960399]
Compact Attention is a hardware-aware acceleration framework featuring three innovations.<n>Our method achieves 1.62.5x acceleration in attention on single- GPU setups.<n>This work provides a principled approach to unlocking efficient long-form video generation through structured sparsity exploitation.
arXiv Detail & Related papers (2025-08-18T14:45:42Z) - S$^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation [55.35880044416441]
We propose S$2$Q-VDiT, a post-training quantization framework for video diffusion models (V-DMs)<n>Under W4A6 quantization, S$2$Q-VDiT achieves lossless performance while delivering $3.9times$ model compression and $1.3times$ inference acceleration.
arXiv Detail & Related papers (2025-08-06T02:12:29Z) - PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models [14.14413223631804]
In visual generation, the quadratic complexity of attention mechanisms results in high memory and computational costs.<n>We propose an alternative strategy: *reorganizing* the attention pattern to alleviate the challenges.<n>Inspired by the local aggregation nature of visual feature extraction, we design a novel **Pattern-Aware token ReOrdering (PARO)** technique.
arXiv Detail & Related papers (2025-06-19T06:25:02Z) - Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion [70.4360995984905]
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
arXiv Detail & Related papers (2025-06-09T17:59:55Z) - Foresight: Adaptive Layer Reuse for Accelerated and High-Quality Text-to-Video Generation [1.3207844222875191]
Diffusion Transformers (DiTs) achieve state-of-the-art results in text-to-image, text-to-video generation, and editing.<n> Static caching mitigates this by reusing features across fixed steps but fails to adapt to generation dynamics.<n>We propose Foresight, an adaptive layer-reuse technique that reduces computational redundancy across denoising steps while preserving baseline performance.
arXiv Detail & Related papers (2025-05-31T00:52:17Z) - RainFusion: Adaptive Video Generation Acceleration via Multi-Dimensional Visual Redundancy [7.196471805257555]
RainFusion exploits inherent sparsity nature in visual data to accelerate attention computation while preserving video quality.<n>Our proposed bf RainFusion is a plug-and-play method that can be seamlessly integrated into state-of-the-art 3D-attention video generation models.
arXiv Detail & Related papers (2025-05-27T11:15:02Z) - FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge [60.000984252907195]
Auto-regressive (AR) models have recently shown promise in visual generation tasks due to their superior sampling efficiency.<n>Video generation requires a substantially larger number of tokens to produce coherent temporal frames, resulting in significant overhead during the decoding phase.<n>We propose the textbfFastCar framework to accelerate the decode phase for the AR video generation by exploring the temporal redundancy.
arXiv Detail & Related papers (2025-05-17T05:00:39Z) - DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance [43.423240627266644]
Diffusion transformer-based video generation models (DiTs) have recently attracted widespread attention for their excellent generation quality.<n>However, their computational cost remains a major bottleneck-attention alone accounts for over 80% of total latency.<n>We propose the DraftAttention, a training-free framework for the acceleration of video diffusion transformers with dynamic sparse attention on GPUs.
arXiv Detail & Related papers (2025-05-17T04:34:34Z) - DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer [56.98400572837792]
DiVE produces high-fidelity, temporally coherent, and cross-view consistent multi-view videos.<n>These innovations collectively achieve a 2.62x speedup with minimal quality degradation.
arXiv Detail & Related papers (2025-04-28T09:20:50Z) - Training-free and Adaptive Sparse Attention for Efficient Long Video Generation [31.615453637053793]
generating high-fidelity long videos with Diffusion Transformers (DiTs) is often hindered by significant latency.<n>We propose AdaSpa, the first Dynamic Pattern and Online Precise Search sparse attention method.<n>AdaSpa is implemented as an adaptive, plug-and-play solution and can be integrated seamlessly with existing DiTs.
arXiv Detail & Related papers (2025-02-28T14:11:20Z) - Improving the Diffusability of Autoencoders [54.920783089085035]
Latent diffusion models have emerged as the leading approach for generating high-quality images and videos.<n>We perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces.<n>We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality.
arXiv Detail & Related papers (2025-02-20T18:45:44Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing.<n>These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators.<n>We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z) - Capturing Co-existing Distortions in User-Generated Content for
No-reference Video Quality Assessment [9.883856205077022]
Video Quality Assessment (VQA) aims to predict the perceptual quality of a video.
VQA faces two under-estimated challenges unresolved in User Generated Content (UGC) videos.
We propose textitVisual Quality Transformer (VQT) to extract quality-related sparse features more efficiently.
arXiv Detail & Related papers (2023-07-31T16:29:29Z) - Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation [55.36617538438858]
We propose a novel approach that strengthens the interaction between spatial and temporal perceptions.
We curate a large-scale and open-source video dataset called HD-VG-130M.
arXiv Detail & Related papers (2023-05-18T11:06:15Z) - Scalable Neural Video Representations with Learnable Positional Features [73.51591757726493]
We show how to train neural representations with learnable positional features (NVP) that effectively amortize a video as latent codes.
We demonstrate the superiority of NVP on the popular UVG benchmark; compared with prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also exceeds their encoding quality as 34.07rightarrow$34.57 (measured with the PSNR metric)
arXiv Detail & Related papers (2022-10-13T08:15:08Z) - STIP: A SpatioTemporal Information-Preserving and Perception-Augmented
Model for High-Resolution Video Prediction [78.129039340528]
We propose a Stemporal Information-Preserving and Perception-Augmented Model (STIP) to solve the above two problems.
The proposed model aims to preserve thetemporal information for videos during the feature extraction and the state transitions.
Experimental results show that the proposed STIP can predict videos with more satisfactory visual quality compared with a variety of state-of-the-art methods.
arXiv Detail & Related papers (2022-06-09T09:49:04Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.