Related papers: OSV: One Step is Enough for High-Quality Image to Video Generation

OSV: One Step is Enough for High-Quality Image to Video Generation

URL: http://arxiv.org/abs/2409.11367v1
Date: Tue, 17 Sep 2024 17:16:37 GMT
Title: OSV: One Step is Enough for High-Quality Image to Video Generation
Authors: Xiaofeng Mao, Zhengkai Jiang, Fu-Yun Wang, Wenbing Zhu, Jiangning Zhang, Hao Chen, Mingmin Chi, Yabiao Wang,
Abstract summary: We introduce a two-stage training framework that effectively combines consistency distillation and GAN training. We also propose a novel video discriminator design, which eliminates the need for decoding the video latents. Our model is capable of producing high-quality videos in merely one-step, with the flexibility to perform multi-step refinement.
Score: 29.77646091911169
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video diffusion models have shown great potential in generating high-quality videos, making them an increasingly popular focus. However, their inherent iterative nature leads to substantial computational and time costs. While efforts have been made to accelerate video diffusion by reducing inference steps (through techniques like consistency distillation) and GAN training (these approaches often fall short in either performance or training stability). In this work, we introduce a two-stage training framework that effectively combines consistency distillation with GAN training to address these challenges. Additionally, we propose a novel video discriminator design, which eliminates the need for decoding the video latents and improves the final performance. Our model is capable of producing high-quality videos in merely one-step, with the flexibility to perform multi-step refinement for further performance enhancement. Our quantitative evaluation on the OpenWebVid-1M benchmark shows that our model significantly outperforms existing methods. Notably, our 1-step performance(FVD 171.15) exceeds the 8-step performance of the consistency distillation based method, AnimateLCM (FVD 184.79), and approaches the 25-step performance of advanced Stable Video Diffusion (FVD 156.94).

Related papers

Seedance 1.0: Exploring the Boundaries of Video Generation Models [71.26796999246068]
Seedance 1.0 is a high-performance and inference-efficient video foundation generation model.<n>It integrates multi-source curation data augmented with precision and meaningful video captioning.<n>Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds ( NVIDIA-L20)
arXiv Detail & Related papers (2025-06-10T17:56:11Z)
DAPE: Dual-Stage Parameter-Efficient Fine-Tuning for Consistent Video Editing with Diffusion Models [1.972901110298768]
We propose DAPE, a high-quality yet cost-effective two-stage parameter-efficient fine-tuning framework for video editing.<n>In the first stage, we design an efficient norm-tuning method to enhance temporal consistency in generated videos.<n>The second stage introduces a vision-friendly adapter to improve visual quality.
arXiv Detail & Related papers (2025-05-11T17:08:50Z)
Diffusion Adversarial Post-Training for One-Step Video Generation [26.14991703029242]
We propose Adversarial Post-Training (APT) against real data following diffusion pre-training for one-step video generation. Our model is capable of generating 1024px images in a single step, achieving quality comparable to state-of-the-art methods.
arXiv Detail & Related papers (2025-01-14T18:51:48Z)
DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization [50.30051934609654]
We introduce a distillation method that combines variational score distillation and consistency distillation to achieve few-step video generation. Our method demonstrates state-of-the-art performance in few-step generation for 10-second videos (128 frames at 12 FPS) One-step distillation accelerates the teacher model's diffusion sampling by up to 278.6 times, enabling near real-time generation.
arXiv Detail & Related papers (2024-12-20T09:07:36Z)
Accelerating Video Diffusion Models via Distribution Matching [26.475459912686986]
This work introduces a novel framework for diffusion distillation and distribution matching. Our approach focuses on distilling pre-trained diffusion models into a more efficient few-step generator. By leveraging a combination of video GAN loss and a novel 2D score distribution matching loss, we demonstrate the potential to generate high-quality video frames.
arXiv Detail & Related papers (2024-12-08T11:36:32Z)
VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide [48.22321420680046]
VideoGuide is a novel framework that enhances the temporal consistency of pretrained text-to-video (T2V) models. It improves temporal quality by interpolating the guiding model's denoised samples into the sampling model's denoising process. The proposed method brings about significant improvement in temporal consistency and image fidelity.
arXiv Detail & Related papers (2024-10-06T05:46:17Z)
QVD: Post-training Quantization for Video Diffusion Models [33.13078954859106]
Post-training quantization (PTQ) is an effective technique to reduce memory footprint and improve computational efficiency. We introduce the first PTQ strategy tailored for video diffusion models, dubbed QVD. We achieve near-lossless performance degradation on W8A8, outperforming the current methods by 205.12 in FVD.
arXiv Detail & Related papers (2024-07-16T10:47:27Z)
Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation [134.22372190926362]
Image diffusion distillation achieves high-fidelity generation with very few sampling steps. Applying these techniques directly to video diffusion often results in unsatisfactory frame quality due to limited visual quality in public video datasets. Our study aims to improve video diffusion distillation while improving frame appearance using abundant high-quality image data.
arXiv Detail & Related papers (2024-06-11T02:09:46Z)
SF-V: Single Forward Video Generation Model [57.292575082410785]
We propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained models. Experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead.
arXiv Detail & Related papers (2024-06-06T17:58:27Z)
One-Step Diffusion Distillation via Deep Equilibrium Models [64.11782639697883]
We introduce a simple yet effective means of distilling diffusion models directly from initial noise to the resulting image. Our method enables fully offline training with just noise/image pairs from the diffusion model. We demonstrate that the DEQ architecture is crucial to this capability, as GET matches a $5times$ larger ViT in terms of FID scores.
arXiv Detail & Related papers (2023-12-12T07:28:40Z)
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models [133.088893990272]
We learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. We propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models.
arXiv Detail & Related papers (2023-09-26T17:52:03Z)
Boost Video Frame Interpolation via Motion Adaptation [73.42573856943923]
Video frame (VFI) is a challenging task that aims to generate intermediate frames between two consecutive frames in a video. Existing learning-based VFI methods have achieved great success, but they still suffer from limited generalization ability. We propose a novel optimization-based VFI method that can adapt to unseen motions at test time.
arXiv Detail & Related papers (2023-06-24T10:44:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.