Individual Content and Motion Dynamics Preserved Pruning for Video Diffusion Models
- URL: http://arxiv.org/abs/2411.18375v1
- Date: Wed, 27 Nov 2024 14:22:13 GMT
- Title: Individual Content and Motion Dynamics Preserved Pruning for Video Diffusion Models
- Authors: Yiming Wu, Huan Wang, Zhenghao Chen, Dong Xu,
- Abstract summary: High computational cost and slow inference time are major obstacles to deploying the video diffusion model (VDM) in practical applications.
We introduce a new Video Diffusion Model Compression approach using individual content and motion dynamics preserved pruning and consistency loss.
- Score: 26.556159722909715
- License:
- Abstract: The high computational cost and slow inference time are major obstacles to deploying the video diffusion model (VDM) in practical applications. To overcome this, we introduce a new Video Diffusion Model Compression approach using individual content and motion dynamics preserved pruning and consistency loss. First, we empirically observe that deeper VDM layers are crucial for maintaining the quality of \textbf{motion dynamics} e.g., coherence of the entire video, while shallower layers are more focused on \textbf{individual content} e.g., individual frames. Therefore, we prune redundant blocks from the shallower layers while preserving more of the deeper layers, resulting in a lightweight VDM variant called VDMini. Additionally, we propose an \textbf{Individual Content and Motion Dynamics (ICMD)} Consistency Loss to gain comparable generation performance as larger VDM, i.e., the teacher to VDMini i.e., the student. Particularly, we first use the Individual Content Distillation (ICD) Loss to ensure consistency in the features of each generated frame between the teacher and student models. Next, we introduce a Multi-frame Content Adversarial (MCA) Loss to enhance the motion dynamics across the generated video as a whole. This method significantly accelerates inference time while maintaining high-quality video generation. Extensive experiments demonstrate the effectiveness of our VDMini on two important video generation tasks, Text-to-Video (T2V) and Image-to-Video (I2V), where we respectively achieve an average 2.5 $\times$ and 1.4 $\times$ speed up for the I2V method SF-V and the T2V method T2V-Turbo-v2, while maintaining the quality of the generated videos on two benchmarks, i.e., UCF101 and VBench.
Related papers
- CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation [75.10635392993748]
We introduce CatV2TON, a vision-based virtual try-on (V2TON) method that supports both image and video try-on tasks.
By temporally concatenating garment and person inputs and training on a mix of image and video datasets, CatV2TON achieves robust try-on performance.
We also present ViViD-S, a refined video try-on dataset, achieved by filtering back-facing frames and applying 3D mask smoothing.
arXiv Detail & Related papers (2025-01-20T08:09:36Z) - Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models [89.79067761383855]
Vchitect-2.0 is a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation.
By introducing a novel Multimodal Diffusion Block, our approach achieves consistent alignment between text descriptions and generated video frames.
To overcome memory and computational bottlenecks, we propose a Memory-efficient Training framework.
arXiv Detail & Related papers (2025-01-14T21:53:11Z) - VidTwin: Video VAE with Decoupled Structure and Dynamics [24.51768013474122]
VidTwin is a video autoencoder that decouples video into two distinct latent spaces.
Structure latent vectors capture overall content and global movement, and Dynamics latent vectors represent fine-grained details and rapid movements.
Experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality.
arXiv Detail & Related papers (2024-12-23T17:16:58Z) - WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model [15.171544722138806]
Video Variational Autoencoder (VAE) encodes videos into a low-dimensional latent space.
VAE is a key component of most Latent Video Diffusion Models (LVDMs)
arXiv Detail & Related papers (2024-11-26T14:23:53Z) - ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models [66.84478240757038]
A majority of video diffusion models (VDMs) generate long videos in an autoregressive manner, i.e., generating subsequent clips conditioned on last frames of previous clip.
We introduce causal (i.e., unidirectional) generation into VDMs, and use past frames as prompt to generate future frames.
Our ViD-GPT achieves state-of-the-art performance both quantitatively and qualitatively on long video generation.
arXiv Detail & Related papers (2024-06-16T15:37:22Z) - Vivid-ZOO: Multi-View Video Generation with Diffusion Model [76.96449336578286]
New challenges lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution.
We propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text.
arXiv Detail & Related papers (2024-06-12T21:44:04Z) - AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation [37.05422543076405]
Image-to-video (I2V) generation aims to use the initial frame (alongside a text prompt) to create a video sequence.
Existing methods often struggle to preserve the integrity of the subject, background, and style from the first frame.
We propose ConsistI2V, a diffusion-based method to enhance visual consistency for I2V generation.
arXiv Detail & Related papers (2024-02-06T19:08:18Z) - Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs [112.39389727164594]
Text-to-video (T2V) synthesis has gained increasing attention in the community, in which the recently emerged diffusion models (DMs) have promisingly shown stronger performance than the past approaches.
While existing state-of-the-art DMs are competent to achieve high-resolution video generation, they may largely suffer from key limitations (e.g., action occurrence disorders, crude video motions) with respect to the temporal dynamics modeling, one of the crux of video synthesis.
In this work, we investigate strengthening awareness of video dynamics for DMs, for high-quality T2V generation
arXiv Detail & Related papers (2023-08-26T08:31:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.