DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization
- URL: http://arxiv.org/abs/2412.15689v1
- Date: Fri, 20 Dec 2024 09:07:36 GMT
- Title: DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization
- Authors: Zihan Ding, Chi Jin, Difan Liu, Haitian Zheng, Krishna Kumar Singh, Qiang Zhang, Yan Kang, Zhe Lin, Yuchen Liu,
- Abstract summary: We introduce a distillation method that combines variational score distillation and consistency distillation to achieve few-step video generation.<n>Our method demonstrates state-of-the-art performance in few-step generation for 10-second videos (128 frames at 12 FPS)<n>One-step distillation accelerates the teacher model's diffusion sampling by up to 278.6 times, enabling near real-time generation.
- Score: 50.30051934609654
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion probabilistic models have shown significant progress in video generation; however, their computational efficiency is limited by the large number of sampling steps required. Reducing sampling steps often compromises video quality or generation diversity. In this work, we introduce a distillation method that combines variational score distillation and consistency distillation to achieve few-step video generation, maintaining both high quality and diversity. We also propose a latent reward model fine-tuning approach to further enhance video generation performance according to any specified reward metric. This approach reduces memory usage and does not require the reward to be differentiable. Our method demonstrates state-of-the-art performance in few-step generation for 10-second videos (128 frames at 12 FPS). The distilled student model achieves a score of 82.57 on VBench, surpassing the teacher model as well as baseline models Gen-3, T2V-Turbo, and Kling. One-step distillation accelerates the teacher model's diffusion sampling by up to 278.6 times, enabling near real-time generation. Human evaluations further validate the superior performance of our 4-step student models compared to teacher model using 50-step DDIM sampling.
Related papers
- AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset [55.82208863521353]
We propose AccVideo to reduce the inference steps for accelerating video diffusion models with synthetic dataset.
Our model achieves 8.5x improvements in generation speed compared to the teacher model.
Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution.
arXiv Detail & Related papers (2025-03-25T08:52:07Z) - Learning Few-Step Diffusion Models by Trajectory Distribution Matching [18.229753357571116]
Trajectory Distribution Matching (TDM) is a unified distillation paradigm that combines the strengths of distribution and trajectory matching.
We develop a sampling-steps-aware objective that decouples learning targets across different steps, enabling more adjustable sampling.
Our model, TDM, outperforms existing methods on various backbones, delivering superior quality and significantly reduced training costs.
arXiv Detail & Related papers (2025-03-09T15:53:49Z) - From Slow Bidirectional to Fast Causal Video Generators [52.32078428442281]
Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies.<n>We address this limitation by adapting a pretrained bidirectional diffusion transformer to a causal transformer that generates frames on-the-fly.<n>Our model supports fast streaming generation of high quality videos at 9.4 FPS on a single GPU thanks to KV caching.
arXiv Detail & Related papers (2024-12-10T18:59:50Z) - Accelerating Video Diffusion Models via Distribution Matching [26.475459912686986]
This work introduces a novel framework for diffusion distillation and distribution matching.<n>Our approach focuses on distilling pre-trained diffusion models into a more efficient few-step generator.<n>By leveraging a combination of video GAN loss and a novel 2D score distribution matching loss, we demonstrate the potential to generate high-quality video frames.
arXiv Detail & Related papers (2024-12-08T11:36:32Z) - OSV: One Step is Enough for High-Quality Image to Video Generation [29.77646091911169]
We introduce a two-stage training framework that effectively combines consistency distillation and GAN training.
We also propose a novel video discriminator design, which eliminates the need for decoding the video latents.
Our model is capable of producing high-quality videos in merely one-step, with the flexibility to perform multi-step refinement.
arXiv Detail & Related papers (2024-09-17T17:16:37Z) - Diffusion Models Are Innate One-Step Generators [2.3359837623080613]
Diffusion Models (DMs) can generate remarkable high-quality results.
DMs' layers are differentially activated at different time steps, leading to an inherent capability to generate images in a single step.
Our method achieves the SOTA results on CIFAR-10, AFHQv2 64x64 (FID 1.23), FFHQ 64x64 (FID 0.85) and ImageNet 64x64 (FID 1.16) with great efficiency.
arXiv Detail & Related papers (2024-05-31T11:14:12Z) - T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback [111.40967379458752]
We introduce T2V-Turbo, which integrates feedback from a mixture of differentiable reward models into the consistency distillation process of a pre-trained T2V model.
Remarkably, the 4-step generations from our T2V-Turbo achieve the highest total score on VBench, even surpassing Gen-2 and Pika.
arXiv Detail & Related papers (2024-05-29T04:26:17Z) - Consistency Models [89.68380014789861]
We propose a new family of models that generate high quality samples by directly mapping noise to data.
They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality.
They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training.
arXiv Detail & Related papers (2023-03-02T18:30:16Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.