TIMERIPPLE: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space
- URL: http://arxiv.org/abs/2511.12035v1
- Date: Sat, 15 Nov 2025 05:07:31 GMT
- Title: TIMERIPPLE: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space
- Authors: Wenxuan Miao, Yulin Sun, Aiyue Chen, Jing Lin, Yiwu Yao, Yiming Gan, Jieru Zhao, Jingwen Leng, Mingyi Guo, Yu Feng,
- Abstract summary: We take a principled approach to accelerate self-attention in vDiTs by leveraging the reuse-temporal correlations in the latent space.<n>We show that the attention patterns within vDiT are primarily due to the dominant spatial and temporal correlations at the token channel level.<n>We propose a lightweight and adaptive strategy that approximates attention computations by reusing partial attention scores spatially or temporally correlated tokens along individual channels.
- Score: 15.535854202219072
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent surge in video generation has shown the growing demand for high-quality video synthesis using large vision models. Existing video generation models are predominantly based on the video diffusion transformer (vDiT), however, they suffer from substantial inference delay due to self-attention. While prior studies have focused on reducing redundant computations in self-attention, they often overlook the inherent spatio-temporal correlations in video streams and directly leverage sparsity patterns from large language models to reduce attention computations. In this work, we take a principled approach to accelerate self-attention in vDiTs by leveraging the spatio-temporal correlations in the latent space. We show that the attention patterns within vDiT are primarily due to the dominant spatial and temporal correlations at the token channel level. Based on this insight, we propose a lightweight and adaptive reuse strategy that approximates attention computations by reusing partial attention scores of spatially or temporally correlated tokens along individual channels. We demonstrate that our method achieves significantly higher computational savings (85\%) compared to state-of-the-art techniques over 4 vDiTs, while preserving almost identical video quality ($<$0.06\% loss on VBench).
Related papers
- Learning Spatio-Temporal Feature Representations for Video-Based Gaze Estimation [50.05866669110754]
Video-based gaze estimation methods aim to capture the inherently temporal dynamics of human eye gaze from multiple image frames.<n>We propose the Spatio-Temporal Gaze Network (ST-Gaze), a model that combines a CNN backbone with dedicated channel attention and self-attention modules.<n>We show that ST-Gaze achieves state-of-the-art performance both with and without person-specific adaptation.
arXiv Detail & Related papers (2025-12-19T15:15:58Z) - USV: Unified Sparsification for Accelerating Video Diffusion Models [11.011602744993942]
Unified Sparsification for Video diffusion models is an end-to-end trainable framework.<n>It orchestrates sparsification across both the model's internal computation and its sampling process.<n>It achieves up to 83.3% speedup in the denoising process and 22.7% end-to-end acceleration, while maintaining high visual fidelity.
arXiv Detail & Related papers (2025-12-05T14:40:06Z) - Sliding Window Attention for Learned Video Compression [67.57073402826292]
This work introduces 3D Sliding Window Attention (SWA), a patchless form of local attention.<n>Our method significantly improves rate-distortion performance, achieving Bjorntegaard Delta-rate savings of up to 18.6 %.
arXiv Detail & Related papers (2025-10-04T20:11:43Z) - Lightning Fast Caching-based Parallel Denoising Prediction for Accelerating Talking Head Generation [50.04968365065964]
Diffusion-based talking head models generate high-quality, photorealistic videos but suffer from slow inference.<n>We introduce Lightning-fast Caching-based Parallel denoising prediction (LightningCP)<n>We also propose Decoupled Foreground Attention (DFA) to further accelerate attention computations.
arXiv Detail & Related papers (2025-08-25T02:58:39Z) - Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation [21.87891961960399]
Compact Attention is a hardware-aware acceleration framework featuring three innovations.<n>Our method achieves 1.62.5x acceleration in attention on single- GPU setups.<n>This work provides a principled approach to unlocking efficient long-form video generation through structured sparsity exploitation.
arXiv Detail & Related papers (2025-08-18T14:45:42Z) - Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape [38.76559841681518]
A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length.<n>Existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads.<n>We propose Re-ttention, which implements very high sparse attention for visual generation models.
arXiv Detail & Related papers (2025-05-28T22:39:12Z) - Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models [64.2445487645478]
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio.
We present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation.
arXiv Detail & Related papers (2024-07-11T17:34:51Z) - Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection [19.643936110623653]
Video Anomaly Detection (VAD) aims to identify abnormalities within a specific context and timeframe.
Recent deep learning-based VAD models have shown promising results by generating high-resolution frames.
We propose a self-supervised learning approach for VAD through an inter-patch relationship prediction task.
arXiv Detail & Related papers (2024-03-28T03:07:16Z) - Exploiting Spatial-temporal Correlations for Video Anomaly Detection [7.336831373786849]
Video anomaly detection (VAD) remains a challenging task in the pattern recognition community due to the ambiguity and diversity of abnormal events.
We introduce a discriminator to perform adversarial learning with the ST-LSTM to enhance the learning capability.
Our method achieves competitive performance compared to the state-of-the-art methods with AUCs of 96.7%, 87.8%, and 73.1% on the UCSD2, CUHK Avenue, and ShanghaiTech, respectively.
arXiv Detail & Related papers (2022-11-02T02:13:24Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - GTA: Global Temporal Attention for Video Action Understanding [51.476605514802806]
We introduce Global Temporal Attention (AGT), which performs global temporal attention on top of spatial attention in a decoupled manner.
Tests on 2D and 3D networks demonstrate that our approach consistently enhances temporal modeling and provides state-of-the-art performance on three video action recognition datasets.
arXiv Detail & Related papers (2020-12-15T18:58:21Z) - Multi-Temporal Convolutions for Human Action Recognition in Videos [83.43682368129072]
We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions.
The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
arXiv Detail & Related papers (2020-11-08T10:40:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.