AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration
- URL: http://arxiv.org/abs/2412.11706v1
- Date: Mon, 16 Dec 2024 12:28:22 GMT
- Title: AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration
- Authors: Wenhao Sun, Rong-Cheng Tu, Jingyi Liao, Zhao Jin, Dacheng Tao,
- Abstract summary: Video Diffusion Transformers (DiTs) have demonstrated significant potential for generating high-fidelity videos but are computationally intensive.
We propose Asymmetric Reduction and Restoration (AsymRnR), a training-free approach to accelerate video DiTs.
- Score: 45.62669899834342
- License:
- Abstract: Video Diffusion Transformers (DiTs) have demonstrated significant potential for generating high-fidelity videos but are computationally intensive. Existing acceleration methods include distillation, which requires costly retraining, and feature caching, which is highly sensitive to network architecture. Recent token reduction methods are training-free and architecture-agnostic, offering greater flexibility and wider applicability. However, they enforce the same sequence length across different components, constraining their acceleration potential. We observe that intra-sequence redundancy in video DiTs varies across features, blocks, and denoising timesteps. Building on this observation, we propose Asymmetric Reduction and Restoration (AsymRnR), a training-free approach to accelerate video DiTs. It offers a flexible and adaptive strategy that reduces the number of tokens based on their redundancy to enhance both acceleration and generation quality. We further propose matching cache to facilitate faster processing. Integrated into state-of-the-art video DiTs, AsymRnR achieves a superior speedup without compromising the quality.
Related papers
- Accelerating Vision Diffusion Transformers with Skip Branches [47.07564477125228]
Diffusion Transformers (DiT) are an emerging image and video generation model architecture.
DiT's practical deployment is constrained by computational complexity and redundancy in the sequential denoising process.
We introduce Skip-DiT, which converts standard DiT into Skip-DiT with skip branches to enhance feature smoothness.
We also introduce Skip-Cache which utilizes the skip branches to cache DiT features across timesteps at the inference time.
arXiv Detail & Related papers (2024-11-26T17:28:10Z) - Adaptive Caching for Faster Video Generation with Diffusion Transformers [52.73348147077075]
Diffusion Transformers (DiTs) rely on larger models and heavier attention mechanisms, resulting in slower inference speeds.
We introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache)
We also introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, controlling the compute allocation based on motion content.
arXiv Detail & Related papers (2024-11-04T18:59:44Z) - SparseTem: Boosting the Efficiency of CNN-Based Video Encoders by Exploiting Temporal Continuity [15.872209884833977]
We propose a memory-efficient scheduling method to eliminate memory overhead and an online adjustment mechanism to minimize accuracy degradation.
SparseTem achieves speedup of 1.79x for EfficientDet and 4.72x for CRNN, with minimal accuracy drop and no additional memory overhead.
arXiv Detail & Related papers (2024-10-28T07:13:25Z) - ReToMe-VA: Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack [71.2286719703198]
We propose the Recursive Token Merging for Video Diffusion-based Unrestricted Adrial Attack (ReToMe-VA)
To achieve spatial imperceptibility, ReToMe-VA adopts a Timestep-wise Adrial Latent Optimization (TALO) strategy.
To achieve temporal imperceptibility, ReToMe-VA introduces a Recursive Token Merging (ReToMe) mechanism by matching and merging tokens across video frames.
arXiv Detail & Related papers (2024-08-10T08:10:30Z) - Binarized Low-light Raw Video Enhancement [49.65466843856074]
Deep neural networks have achieved excellent performance on low-light raw video enhancement.
In this paper, we explore the feasibility of applying the extremely compact binary neural network (BNN) to low-light raw video enhancement.
arXiv Detail & Related papers (2024-03-29T02:55:07Z) - Boosting Neural Representations for Videos with a Conditional Decoder [28.073607937396552]
Implicit neural representations (INRs) have emerged as a promising approach for video storage and processing.
This paper introduces a universal boosting framework for current implicit video representation approaches.
arXiv Detail & Related papers (2024-02-28T08:32:19Z) - HaltingVT: Adaptive Token Halting Transformer for Efficient Video
Recognition [11.362605513514943]
Action recognition in videos poses a challenge due to its high computational cost.
We propose HaltingVT, an efficient video transformer adaptively removing redundant video patch tokens.
On the Mini-Kinetics dataset, we achieved 75.0% top-1 ACC with 24.2 GFLOPs, as well as 67.2% top-1 ACC with an extremely low 9.9 GFLOPs.
arXiv Detail & Related papers (2024-01-10T07:42:55Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.