Diffusion Adversarial Post-Training for One-Step Video Generation
- URL: http://arxiv.org/abs/2501.08316v1
- Date: Tue, 14 Jan 2025 18:51:48 GMT
- Title: Diffusion Adversarial Post-Training for One-Step Video Generation
- Authors: Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, Lu Jiang,
- Abstract summary: We propose Adversarial Post-Training (APT) against real data following diffusion pre-training for one-step video generation.
Our model is capable of generating 1024px images in a single step, achieving quality comparable to state-of-the-art methods.
- Score: 26.14991703029242
- License:
- Abstract: The diffusion models are widely used for image and video generation, but their iterative generation process is slow and expansive. While existing distillation approaches have demonstrated the potential for one-step generation in the image domain, they still suffer from significant quality degradation. In this work, we propose Adversarial Post-Training (APT) against real data following diffusion pre-training for one-step video generation. To improve the training stability and quality, we introduce several improvements to the model architecture and training procedures, along with an approximated R1 regularization objective. Empirically, our experiments show that our adversarial post-trained model, Seaweed-APT, can generate 2-second, 1280x720, 24fps videos in real time using a single forward evaluation step. Additionally, our model is capable of generating 1024px images in a single step, achieving quality comparable to state-of-the-art methods.
Related papers
- Real-time One-Step Diffusion-based Expressive Portrait Videos Generation [85.07446744308247]
We introduce OSA-LCM (One-Step Avatar Latent Consistency Model), paving the way for real-time diffusion-based avatars.
Our method achieves comparable video quality to existing methods but requires only one sampling step, making it more than 10x faster.
arXiv Detail & Related papers (2024-12-18T03:42:42Z) - Accelerating Video Diffusion Models via Distribution Matching [26.475459912686986]
This work introduces a novel framework for diffusion distillation and distribution matching.
Our approach focuses on distilling pre-trained diffusion models into a more efficient few-step generator.
By leveraging a combination of video GAN loss and a novel 2D score distribution matching loss, we demonstrate the potential to generate high-quality video frames.
arXiv Detail & Related papers (2024-12-08T11:36:32Z) - VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide [48.22321420680046]
VideoGuide is a novel framework that enhances the temporal consistency of pretrained text-to-video (T2V) models.
It improves temporal quality by interpolating the guiding model's denoised samples into the sampling model's denoising process.
The proposed method brings about significant improvement in temporal consistency and image fidelity.
arXiv Detail & Related papers (2024-10-06T05:46:17Z) - OSV: One Step is Enough for High-Quality Image to Video Generation [29.77646091911169]
We introduce a two-stage training framework that effectively combines consistency distillation and GAN training.
We also propose a novel video discriminator design, which eliminates the need for decoding the video latents.
Our model is capable of producing high-quality videos in merely one-step, with the flexibility to perform multi-step refinement.
arXiv Detail & Related papers (2024-09-17T17:16:37Z) - SF-V: Single Forward Video Generation Model [57.292575082410785]
We propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained models.
Experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead.
arXiv Detail & Related papers (2024-06-06T17:58:27Z) - Upsample Guidance: Scale Up Diffusion Models without Training [0.0]
We introduce upsample guidance, a technique that adapts pretrained diffusion model to generate higher-resolution images.
Remarkably, this technique does not necessitate any additional training or relying on external models.
We demonstrate that upsample guidance can be applied to various models, such as pixel-space, latent space, and video diffusion models.
arXiv Detail & Related papers (2024-04-02T07:49:08Z) - Make a Cheap Scaling: A Self-Cascade Diffusion Model for
Higher-Resolution Adaptation [112.08287900261898]
This paper proposes a novel self-cascade diffusion model for rapid adaptation to higher-resolution image and video generation.
Our approach achieves a 5X training speed-up and requires only an additional 0.002M tuning parameters.
Experiments demonstrate that our approach can quickly adapt to higher resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.
arXiv Detail & Related papers (2024-02-16T07:48:35Z) - One-Step Diffusion Distillation via Deep Equilibrium Models [64.11782639697883]
We introduce a simple yet effective means of distilling diffusion models directly from initial noise to the resulting image.
Our method enables fully offline training with just noise/image pairs from the diffusion model.
We demonstrate that the DEQ architecture is crucial to this capability, as GET matches a $5times$ larger ViT in terms of FID scores.
arXiv Detail & Related papers (2023-12-12T07:28:40Z) - AdaDiff: Adaptive Step Selection for Fast Diffusion Models [82.78899138400435]
We introduce AdaDiff, a lightweight framework designed to learn instance-specific step usage policies.
AdaDiff is optimized using a policy method to maximize a carefully designed reward function.
We conduct experiments on three image generation and two video generation benchmarks and demonstrate that our approach achieves similar visual quality compared to the baseline.
arXiv Detail & Related papers (2023-11-24T11:20:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.