Related papers: Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

URL: http://arxiv.org/abs/2505.22918v2
Date: Fri, 30 May 2025 17:09:51 GMT
Title: Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape
Authors: Ruichen Chen, Keith G. Mills, Liyao Jiang, Chao Gao, Di Niu,
Abstract summary: A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length.<n>Existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads.<n>We propose Re-ttention, which implements very high sparse attention for visual generation models.
Score: 23.01286982392074
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. % To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. % Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1\% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference. Further, we measure latency to show that our method can attain over 45\% end-to-end % and over 92\% self-attention latency reduction on an H100 GPU at negligible overhead cost. Code available online here: \href{https://github.com/cccrrrccc/Re-ttention}{https://github.com/cccrrrccc/Re-ttention}

Related papers

PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models [14.14413223631804]
In visual generation, the quadratic complexity of attention mechanisms results in high memory and computational costs.<n>We propose an alternative strategy: *reorganizing* the attention pattern to alleviate the challenges.<n>Inspired by the local aggregation nature of visual feature extraction, we design a novel **Pattern-Aware token ReOrdering (PARO)** technique.
arXiv Detail & Related papers (2025-06-19T06:25:02Z)
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion [70.4360995984905]
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
arXiv Detail & Related papers (2025-06-09T17:59:55Z)
Foresight: Adaptive Layer Reuse for Accelerated and High-Quality Text-to-Video Generation [1.3207844222875191]
Diffusion Transformers (DiTs) achieve state-of-the-art results in text-to-image, text-to-video generation, and editing.<n> Static caching mitigates this by reusing features across fixed steps but fails to adapt to generation dynamics.<n>We propose Foresight, an adaptive layer-reuse technique that reduces computational redundancy across denoising steps while preserving baseline performance.
arXiv Detail & Related papers (2025-05-31T00:52:17Z)
RainFusion: Adaptive Video Generation Acceleration via Multi-Dimensional Visual Redundancy [7.196471805257555]
RainFusion exploits inherent sparsity nature in visual data to accelerate attention computation while preserving video quality.<n>Our proposed bf RainFusion is a plug-and-play method that can be seamlessly integrated into state-of-the-art 3D-attention video generation models.
arXiv Detail & Related papers (2025-05-27T11:15:02Z)
FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge [60.000984252907195]
Auto-regressive (AR) models have recently shown promise in visual generation tasks due to their superior sampling efficiency.<n>Video generation requires a substantially larger number of tokens to produce coherent temporal frames, resulting in significant overhead during the decoding phase.<n>We propose the textbfFastCar framework to accelerate the decode phase for the AR video generation by exploring the temporal redundancy.
arXiv Detail & Related papers (2025-05-17T05:00:39Z)
DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance [43.423240627266644]
Diffusion transformer-based video generation models (DiTs) have recently attracted widespread attention for their excellent generation quality.<n>However, their computational cost remains a major bottleneck-attention alone accounts for over 80% of total latency.<n>We propose the DraftAttention, a training-free framework for the acceleration of video diffusion transformers with dynamic sparse attention on GPUs.
arXiv Detail & Related papers (2025-05-17T04:34:34Z)
Training-free and Adaptive Sparse Attention for Efficient Long Video Generation [31.615453637053793]
generating high-fidelity long videos with Diffusion Transformers (DiTs) is often hindered by significant latency.<n>We propose AdaSpa, the first Dynamic Pattern and Online Precise Search sparse attention method.<n>AdaSpa is implemented as an adaptive, plug-and-play solution and can be integrated seamlessly with existing DiTs.
arXiv Detail & Related papers (2025-02-28T14:11:20Z)
Improving the Diffusability of Autoencoders [54.920783089085035]
Latent diffusion models have emerged as the leading approach for generating high-quality images and videos.<n>We perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces.<n>We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality.
arXiv Detail & Related papers (2025-02-20T18:45:44Z)
SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method. We distribute features of space-time tubes evenly across a limited number of learnable clusters. Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z)
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing.<n>These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators.<n>We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z)
Scalable Neural Video Representations with Learnable Positional Features [73.51591757726493]
We show how to train neural representations with learnable positional features (NVP) that effectively amortize a video as latent codes. We demonstrate the superiority of NVP on the popular UVG benchmark; compared with prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also exceeds their encoding quality as 34.07rightarrow$34.57 (measured with the PSNR metric)
arXiv Detail & Related papers (2022-10-13T08:15:08Z)
STIP: A SpatioTemporal Information-Preserving and Perception-Augmented Model for High-Resolution Video Prediction [78.129039340528]
We propose a Stemporal Information-Preserving and Perception-Augmented Model (STIP) to solve the above two problems. The proposed model aims to preserve thetemporal information for videos during the feature extraction and the state transitions. Experimental results show that the proposed STIP can predict videos with more satisfactory visual quality compared with a variety of state-of-the-art methods.
arXiv Detail & Related papers (2022-06-09T09:49:04Z)
Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length. It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity. Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.