VDOT: Efficient Unified Video Creation via Optimal Transport Distillation
- URL: http://arxiv.org/abs/2512.06802v2
- Date: Mon, 15 Dec 2025 03:40:01 GMT
- Title: VDOT: Efficient Unified Video Creation via Optimal Transport Distillation
- Authors: Yutong Wang, Haiyu Zhang, Tianfan Xue, Yu Qiao, Yaohui Wang, Chang Xu, Xinyuan Chen,
- Abstract summary: We propose an efficient unified video creation model, named VDOT.<n>We employ a novel computational optimal transport (OT) technique to optimize the discrepancy between the real and fake score distributions.<n>To support training unified video creation models, we propose a fully automated pipeline for video data annotation and filtering.
- Score: 70.02065520468726
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid development of generative models has significantly advanced image and video applications. Among these, video creation, aimed at generating videos under various conditions, has gained substantial attention. However, existing video creation models either focus solely on a few specific conditions or suffer from excessively long generation times due to complex model inference, making them impractical for real-world applications. To mitigate these issues, we propose an efficient unified video creation model, named VDOT. Concretely, we model the training process with the distribution matching distillation (DMD) paradigm. Instead of using the Kullback-Leibler (KL) minimization, we additionally employ a novel computational optimal transport (OT) technique to optimize the discrepancy between the real and fake score distributions. The OT distance inherently imposes geometric constraints, mitigating potential zero-forcing or gradient collapse issues that may arise during KL-based distillation within the few-step generation scenario, and thus, enhances the efficiency and stability of the distillation process. Further, we integrate a discriminator to enable the model to perceive real video data, thereby enhancing the quality of generated videos. To support training unified video creation models, we propose a fully automated pipeline for video data annotation and filtering that accommodates multiple video creation tasks. Meanwhile, we curate a unified testing benchmark, UVCBench, to standardize evaluation. Experiments demonstrate that our 4-step VDOT outperforms or matches other baselines with 100 denoising steps.
Related papers
- CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving [26.379817613036597]
CVD-STORM is a cross-view video diffusion model utilizing a spatial-temporal reconstruction Variational Autoencoder (VAE)<n>Our approach first fine-tunes the VAE with an auxiliary 4D reconstruction task, enhancing its ability to encode 3D structures and temporal dynamics.<n> Experimental results demonstrate that our model achieves substantial improvements in both FID and FVD metrics.
arXiv Detail & Related papers (2025-10-09T08:41:58Z) - SwiftVideo: A Unified Framework for Few-Step Video Generation through Trajectory-Distribution Alignment [76.60024640625478]
Diffusion-based or flow-based models have achieved significant progress in video synthesis but require multiple iterative sampling steps.<n>We propose a unified and stable distillation framework that combines the advantages of trajectory-preserving and distribution-matching strategies.<n>Our method maintains high-quality video generation while substantially reducing the number of inference steps.
arXiv Detail & Related papers (2025-08-08T07:26:34Z) - Training-Free Motion Customization for Distilled Video Generators with Adaptive Test-Time Distillation [53.877572078307935]
Distilled video generation models offer fast and efficient but struggle with motion customization when guided by reference videos.<n>We propose MotionEcho, a training-free test-time distillation framework that enables motion customization by leveraging diffusion teacher forcing.
arXiv Detail & Related papers (2025-06-24T06:20:15Z) - Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion [67.94300151774085]
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
arXiv Detail & Related papers (2025-06-09T17:59:55Z) - DiffuseSlide: Training-Free High Frame Rate Video Generation Diffusion [4.863177884263436]
We present a training-free approach for high FPS video generation using pre-trained diffusion models.<n>Our method, DiffuseSlide, introduces a new pipeline that leverages key frames from low FPS videos and applies innovative techniques, including noise re-injection and sliding window latent denoising.<n>Through extensive experiments, we demonstrate that our approach significantly improves video quality, offering enhanced temporal coherence and spatial fidelity.
arXiv Detail & Related papers (2025-06-02T09:12:41Z) - AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset [55.82208863521353]
We propose AccVideo to reduce the inference steps for accelerating video diffusion models with synthetic dataset.<n>Our model achieves 8.5x improvements in generation speed compared to the teacher model.<n>Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution.
arXiv Detail & Related papers (2025-03-25T08:52:07Z) - From Slow Bidirectional to Fast Autoregressive Video Diffusion Models [48.35054927704544]
Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies.<n>We address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly.<n>Our model achieves a total score of 84.27 on the VBench-Long benchmark, surpassing all previous video generation models.
arXiv Detail & Related papers (2024-12-10T18:59:50Z) - Accelerating Video Diffusion Models via Distribution Matching [26.475459912686986]
This work introduces a novel framework for diffusion distillation and distribution matching.<n>Our approach focuses on distilling pre-trained diffusion models into a more efficient few-step generator.<n>By leveraging a combination of video GAN loss and a novel 2D score distribution matching loss, we demonstrate the potential to generate high-quality video frames.
arXiv Detail & Related papers (2024-12-08T11:36:32Z) - SF-V: Single Forward Video Generation Model [57.292575082410785]
We propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained models.
Experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead.
arXiv Detail & Related papers (2024-06-06T17:58:27Z) - AnimateLCM: Computation-Efficient Personalized Style Video Generation without Personalized Video Data [45.20627288830823]
It reduces the necessary generation time of similarly sized video diffusion models from 25 seconds to around 1 second.
The method's effectiveness lies in its dual-level decoupling learning approach.
arXiv Detail & Related papers (2024-02-01T16:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.