AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration
- URL: http://arxiv.org/abs/2412.11706v2
- Date: Sun, 09 Mar 2025 16:14:51 GMT
- Title: AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration
- Authors: Wenhao Sun, Rong-Cheng Tu, Jingyi Liao, Zhao Jin, Dacheng Tao,
- Abstract summary: Diffusion Transformers (DiTs) have proven effective in generating high-quality videos but are hindered by high computational costs.<n>We propose Asymmetric Reduction and Restoration (AsymRnR), a training-free and model-agnostic method to accelerate video DiTs.
- Score: 45.62669899834342
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion Transformers (DiTs) have proven effective in generating high-quality videos but are hindered by high computational costs. Existing video DiT sampling acceleration methods often rely on costly fine-tuning or exhibit limited generalization capabilities. We propose Asymmetric Reduction and Restoration (AsymRnR), a training-free and model-agnostic method to accelerate video DiTs. It builds on the observation that redundancies of feature tokens in DiTs vary significantly across different model blocks, denoising steps, and feature types. Our AsymRnR asymmetrically reduces redundant tokens in the attention operation, achieving acceleration with negligible degradation in output quality and, in some cases, even improving it. We also tailored a reduction schedule to distribute the reduction across components adaptively. To further accelerate this process, we introduce a matching cache for more efficient reduction. Backed by theoretical foundations and extensive experimental validation, AsymRnR integrates into state-of-the-art video DiTs and offers substantial speedup.
Related papers
- H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models [76.1519545010611]
Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation.
In this work, we examine the architecture design choices and optimize the computation distribution to obtain efficient and high-compression video AEs.
Our AE achieves an ultra-high compression ratio and real-time decoding speed on mobile while outperforming prior art in terms of reconstruction metrics.
arXiv Detail & Related papers (2025-04-14T17:59:06Z) - DDT: Decoupled Diffusion Transformer [51.84206763079382]
Diffusion transformers encode noisy inputs to extract semantic component and decode higher frequency with identical modules.
textbfcolorddtDecoupled textbfcolorddtTransformer(textbfcolorddtDDT)
textbfcolorddtTransformer(textbfcolorddtDDT)
textbfcolorddtTransformer(textbfcolorddtDDT)
arXiv Detail & Related papers (2025-04-08T07:17:45Z) - Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model.
We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch.
Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z) - Accelerating Vision Diffusion Transformers with Skip Branches [47.07564477125228]
Diffusion Transformers (DiT) are an emerging image and video generation model architecture.<n>DiT's practical deployment is constrained by computational complexity and redundancy in the sequential denoising process.<n>We introduce Skip-DiT, which converts standard DiT into Skip-DiT with skip branches to enhance feature smoothness.<n>We also introduce Skip-Cache which utilizes the skip branches to cache DiT features across timesteps at the inference time.
arXiv Detail & Related papers (2024-11-26T17:28:10Z) - Adaptive Caching for Faster Video Generation with Diffusion Transformers [52.73348147077075]
Diffusion Transformers (DiTs) rely on larger models and heavier attention mechanisms, resulting in slower inference speeds.
We introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache)
We also introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, controlling the compute allocation based on motion content.
arXiv Detail & Related papers (2024-11-04T18:59:44Z) - ReToMe-VA: Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack [71.2286719703198]
We propose the Recursive Token Merging for Video Diffusion-based Unrestricted Adrial Attack (ReToMe-VA)
To achieve spatial imperceptibility, ReToMe-VA adopts a Timestep-wise Adrial Latent Optimization (TALO) strategy.
To achieve temporal imperceptibility, ReToMe-VA introduces a Recursive Token Merging (ReToMe) mechanism by matching and merging tokens across video frames.
arXiv Detail & Related papers (2024-08-10T08:10:30Z) - FORA: Fast-Forward Caching in Diffusion Transformer Acceleration [39.51519525071639]
Diffusion transformers (DiT) have become the de facto choice for generating high-quality images and videos.
Fast-FORward CAching (FORA) is designed to accelerate DiT by exploiting the repetitive nature of the diffusion process.
arXiv Detail & Related papers (2024-07-01T16:14:37Z) - ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation [23.99995355561429]
Post-Training Quantization (PTQ) is an effective method for reducing memory costs and computational complexity.
We introduce ViDiT-Q (Video & Image Diffusion Transformer Quantization), tailored specifically for DiT models.
We validate the effectiveness of ViDiT-Q across a variety of text-to-image and video models, achieving W8A8 and W4A8 with negligible degradation in visual quality and metrics.
arXiv Detail & Related papers (2024-06-04T17:57:10Z) - Binarized Low-light Raw Video Enhancement [49.65466843856074]
Deep neural networks have achieved excellent performance on low-light raw video enhancement.
In this paper, we explore the feasibility of applying the extremely compact binary neural network (BNN) to low-light raw video enhancement.
arXiv Detail & Related papers (2024-03-29T02:55:07Z) - HaltingVT: Adaptive Token Halting Transformer for Efficient Video
Recognition [11.362605513514943]
Action recognition in videos poses a challenge due to its high computational cost.
We propose HaltingVT, an efficient video transformer adaptively removing redundant video patch tokens.
On the Mini-Kinetics dataset, we achieved 75.0% top-1 ACC with 24.2 GFLOPs, as well as 67.2% top-1 ACC with an extremely low 9.9 GFLOPs.
arXiv Detail & Related papers (2024-01-10T07:42:55Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.