OSV: One Step is Enough for High-Quality Image to Video Generation
- URL: http://arxiv.org/abs/2409.11367v1
- Date: Tue, 17 Sep 2024 17:16:37 GMT
- Title: OSV: One Step is Enough for High-Quality Image to Video Generation
- Authors: Xiaofeng Mao, Zhengkai Jiang, Fu-Yun Wang, Wenbing Zhu, Jiangning Zhang, Hao Chen, Mingmin Chi, Yabiao Wang,
- Abstract summary: We introduce a two-stage training framework that effectively combines consistency distillation and GAN training.
We also propose a novel video discriminator design, which eliminates the need for decoding the video latents.
Our model is capable of producing high-quality videos in merely one-step, with the flexibility to perform multi-step refinement.
- Score: 29.77646091911169
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video diffusion models have shown great potential in generating high-quality videos, making them an increasingly popular focus. However, their inherent iterative nature leads to substantial computational and time costs. While efforts have been made to accelerate video diffusion by reducing inference steps (through techniques like consistency distillation) and GAN training (these approaches often fall short in either performance or training stability). In this work, we introduce a two-stage training framework that effectively combines consistency distillation with GAN training to address these challenges. Additionally, we propose a novel video discriminator design, which eliminates the need for decoding the video latents and improves the final performance. Our model is capable of producing high-quality videos in merely one-step, with the flexibility to perform multi-step refinement for further performance enhancement. Our quantitative evaluation on the OpenWebVid-1M benchmark shows that our model significantly outperforms existing methods. Notably, our 1-step performance(FVD 171.15) exceeds the 8-step performance of the consistency distillation based method, AnimateLCM (FVD 184.79), and approaches the 25-step performance of advanced Stable Video Diffusion (FVD 156.94).
Related papers
- VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide [48.22321420680046]
VideoGuide is a novel framework that enhances the temporal consistency of pretrained text-to-video (T2V) models.
It improves temporal quality by interpolating the guiding model's denoised samples into the sampling model's denoising process.
The proposed method brings about significant improvement in temporal consistency and image fidelity.
arXiv Detail & Related papers (2024-10-06T05:46:17Z) - QVD: Post-training Quantization for Video Diffusion Models [33.13078954859106]
Post-training quantization (PTQ) is an effective technique to reduce memory footprint and improve computational efficiency.
We introduce the first PTQ strategy tailored for video diffusion models, dubbed QVD.
We achieve near-lossless performance degradation on W8A8, outperforming the current methods by 205.12 in FVD.
arXiv Detail & Related papers (2024-07-16T10:47:27Z) - Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation [134.22372190926362]
Image diffusion distillation achieves high-fidelity generation with very few sampling steps.
Applying these techniques directly to video diffusion often results in unsatisfactory frame quality due to limited visual quality in public video datasets.
Our study aims to improve video diffusion distillation while improving frame appearance using abundant high-quality image data.
arXiv Detail & Related papers (2024-06-11T02:09:46Z) - SF-V: Single Forward Video Generation Model [57.292575082410785]
We propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained models.
Experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead.
arXiv Detail & Related papers (2024-06-06T17:58:27Z) - One-Step Diffusion Distillation via Deep Equilibrium Models [64.11782639697883]
We introduce a simple yet effective means of distilling diffusion models directly from initial noise to the resulting image.
Our method enables fully offline training with just noise/image pairs from the diffusion model.
We demonstrate that the DEQ architecture is crucial to this capability, as GET matches a $5times$ larger ViT in terms of FID scores.
arXiv Detail & Related papers (2023-12-12T07:28:40Z) - LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion
Models [133.088893990272]
We learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis.
We propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models.
arXiv Detail & Related papers (2023-09-26T17:52:03Z) - Boost Video Frame Interpolation via Motion Adaptation [73.42573856943923]
Video frame (VFI) is a challenging task that aims to generate intermediate frames between two consecutive frames in a video.
Existing learning-based VFI methods have achieved great success, but they still suffer from limited generalization ability.
We propose a novel optimization-based VFI method that can adapt to unseen motions at test time.
arXiv Detail & Related papers (2023-06-24T10:44:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.