Discriminator-Free Direct Preference Optimization for Video Diffusion
- URL: http://arxiv.org/abs/2504.08542v1
- Date: Fri, 11 Apr 2025 13:55:48 GMT
- Title: Discriminator-Free Direct Preference Optimization for Video Diffusion
- Authors: Haoran Cheng, Qide Dong, Liang Peng, Zhizhou Sha, Weiguo Feng, Jinghui Xie, Zhao Song, Shilei Wen, Xiaofei He, Boxi Wu,
- Abstract summary: We propose a discriminator-free video DPO framework that uses original real videos as win cases and edited versions as lose cases.<n>We theoretically prove the framework's effectiveness even when real videos and model-generated videos follow different distributions.
- Score: 25.304451979598863
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Direct Preference Optimization (DPO), which aligns models with human preferences through win/lose data pairs, has achieved remarkable success in language and image generation. However, applying DPO to video diffusion models faces critical challenges: (1) Data inefficiency. Generating thousands of videos per DPO iteration incurs prohibitive costs; (2) Evaluation uncertainty. Human annotations suffer from subjective bias, and automated discriminators fail to detect subtle temporal artifacts like flickering or motion incoherence. To address these, we propose a discriminator-free video DPO framework that: (1) Uses original real videos as win cases and their edited versions (e.g., reversed, shuffled, or noise-corrupted clips) as lose cases; (2) Trains video diffusion models to distinguish and avoid artifacts introduced by editing. This approach eliminates the need for costly synthetic video comparisons, provides unambiguous quality signals, and enables unlimited training data expansion through simple editing operations. We theoretically prove the framework's effectiveness even when real videos and model-generated videos follow different distributions. Experiments on CogVideoX demonstrate the efficiency of the proposed method.
Related papers
- Direct Motion Models for Assessing Generated Videos [38.04485796547767]
A current limitation of video generative video models is that they generate plausible looking frames, but poor motion.
Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion.
We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data.
arXiv Detail & Related papers (2025-04-30T22:34:52Z) - Learning from Streaming Video with Orthogonal Gradients [62.51504086522027]
We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner.<n>This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch.<n>We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks.
arXiv Detail & Related papers (2025-04-02T17:59:57Z) - AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset [55.82208863521353]
We propose AccVideo to reduce the inference steps for accelerating video diffusion models with synthetic dataset.
Our model achieves 8.5x improvements in generation speed compared to the teacher model.
Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution.
arXiv Detail & Related papers (2025-03-25T08:52:07Z) - VideoPure: Diffusion-based Adversarial Purification for Video Recognition [21.317424798634086]
We propose the first diffusion-based video purification framework to improve video recognition models' adversarial robustness: VideoPure.<n>We employ temporal DDIM inversion to transform the input distribution into a temporally consistent and trajectory-defined distribution, covering adversarial noise while preserving more video structure.<n>We investigate the defense performance of our method against black-box, gray-box, and adaptive attacks on benchmark datasets and models.
arXiv Detail & Related papers (2025-01-25T00:24:51Z) - OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization [30.6130504613716]
We introduce OnlineVPO, a preference learning approach tailored specifically for video diffusion models.<n>By employing the video reward model to offer concise video feedback on the fly, OnlineVPO offers effective and efficient preference guidance.
arXiv Detail & Related papers (2024-12-19T18:34:50Z) - Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM [54.2320450886902]
Text-to-video models have made remarkable advancements through optimization on high-quality text-video pairs.<n>Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware.<n>We introduce Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model.
arXiv Detail & Related papers (2024-12-19T18:32:21Z) - Real-time One-Step Diffusion-based Expressive Portrait Videos Generation [85.07446744308247]
We introduce OSA-LCM (One-Step Avatar Latent Consistency Model), paving the way for real-time diffusion-based avatars.<n>Our method achieves comparable video quality to existing methods but requires only one sampling step, making it more than 10x faster.
arXiv Detail & Related papers (2024-12-18T03:42:42Z) - Video Summarization using Denoising Diffusion Probabilistic Model [21.4190413531697]
We introduce a generative framework for video summarization that learns how to generate summaries from a probability distribution perspective.<n>Specifically, we propose a novel diffusion summarization method based on the Denoising Diffusion Probabilistic Model (DDPM), which learns the probability distribution of training data through noise prediction.<n>Our method is more resistant to subjective annotation noise, and is less prone to overfitting the training data than discriminative methods, with strong generalization ability.
arXiv Detail & Related papers (2024-12-11T13:02:09Z) - COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.<n>We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.<n>COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z) - InstructVideo: Instructing Video Diffusion Models with Human Feedback [65.9590462317474]
We propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning.
InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain, we recast reward fine-tuning as editing.
arXiv Detail & Related papers (2023-12-19T17:55:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.