OSV: One Step is Enough for High-Quality Image to Video Generation
- URL: http://arxiv.org/abs/2409.11367v1
- Date: Tue, 17 Sep 2024 17:16:37 GMT
- Title: OSV: One Step is Enough for High-Quality Image to Video Generation
- Authors: Xiaofeng Mao, Zhengkai Jiang, Fu-Yun Wang, Wenbing Zhu, Jiangning Zhang, Hao Chen, Mingmin Chi, Yabiao Wang,
- Abstract summary: We introduce a two-stage training framework that effectively combines consistency distillation and GAN training.
We also propose a novel video discriminator design, which eliminates the need for decoding the video latents.
Our model is capable of producing high-quality videos in merely one-step, with the flexibility to perform multi-step refinement.
- Score: 29.77646091911169
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video diffusion models have shown great potential in generating high-quality videos, making them an increasingly popular focus. However, their inherent iterative nature leads to substantial computational and time costs. While efforts have been made to accelerate video diffusion by reducing inference steps (through techniques like consistency distillation) and GAN training (these approaches often fall short in either performance or training stability). In this work, we introduce a two-stage training framework that effectively combines consistency distillation with GAN training to address these challenges. Additionally, we propose a novel video discriminator design, which eliminates the need for decoding the video latents and improves the final performance. Our model is capable of producing high-quality videos in merely one-step, with the flexibility to perform multi-step refinement for further performance enhancement. Our quantitative evaluation on the OpenWebVid-1M benchmark shows that our model significantly outperforms existing methods. Notably, our 1-step performance(FVD 171.15) exceeds the 8-step performance of the consistency distillation based method, AnimateLCM (FVD 184.79), and approaches the 25-step performance of advanced Stable Video Diffusion (FVD 156.94).
Related papers
- Diffusion Adversarial Post-Training for One-Step Video Generation [26.14991703029242]
We propose Adversarial Post-Training (APT) against real data following diffusion pre-training for one-step video generation.
Our model is capable of generating 1024px images in a single step, achieving quality comparable to state-of-the-art methods.
arXiv Detail & Related papers (2025-01-14T18:51:48Z) - DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization [50.30051934609654]
We introduce a distillation method that combines variational score distillation and consistency distillation to achieve few-step video generation.
Our method demonstrates state-of-the-art performance in few-step generation for 10-second videos (128 frames at 12 FPS)
One-step distillation accelerates the teacher model's diffusion sampling by up to 278.6 times, enabling near real-time generation.
arXiv Detail & Related papers (2024-12-20T09:07:36Z) - Real-time One-Step Diffusion-based Expressive Portrait Videos Generation [85.07446744308247]
We introduce OSA-LCM (One-Step Avatar Latent Consistency Model), paving the way for real-time diffusion-based avatars.
Our method achieves comparable video quality to existing methods but requires only one sampling step, making it more than 10x faster.
arXiv Detail & Related papers (2024-12-18T03:42:42Z) - Accelerating Video Diffusion Models via Distribution Matching [26.475459912686986]
This work introduces a novel framework for diffusion distillation and distribution matching.
Our approach focuses on distilling pre-trained diffusion models into a more efficient few-step generator.
By leveraging a combination of video GAN loss and a novel 2D score distribution matching loss, we demonstrate the potential to generate high-quality video frames.
arXiv Detail & Related papers (2024-12-08T11:36:32Z) - VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide [48.22321420680046]
VideoGuide is a novel framework that enhances the temporal consistency of pretrained text-to-video (T2V) models.
It improves temporal quality by interpolating the guiding model's denoised samples into the sampling model's denoising process.
The proposed method brings about significant improvement in temporal consistency and image fidelity.
arXiv Detail & Related papers (2024-10-06T05:46:17Z) - Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation [134.22372190926362]
Image diffusion distillation achieves high-fidelity generation with very few sampling steps.
Applying these techniques directly to video diffusion often results in unsatisfactory frame quality due to limited visual quality in public video datasets.
Our study aims to improve video diffusion distillation while improving frame appearance using abundant high-quality image data.
arXiv Detail & Related papers (2024-06-11T02:09:46Z) - SF-V: Single Forward Video Generation Model [57.292575082410785]
We propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained models.
Experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead.
arXiv Detail & Related papers (2024-06-06T17:58:27Z) - LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion
Models [133.088893990272]
We learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis.
We propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models.
arXiv Detail & Related papers (2023-09-26T17:52:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.