VideoDPO: Omni-Preference Alignment for Video Diffusion Generation
- URL: http://arxiv.org/abs/2412.14167v1
- Date: Wed, 18 Dec 2024 18:59:49 GMT
- Title: VideoDPO: Omni-Preference Alignment for Video Diffusion Generation
- Authors: Runtao Liu, Haoyu Wu, Zheng Ziqiang, Chen Wei, Yingqing He, Renjie Pi, Qifeng Chen,
- Abstract summary: Direct Preference Optimization (DPO) has demonstrated significant improvements in language and image generation.
We propose a VideoDPO pipeline by making several key adjustments.
Our experiments demonstrate substantial improvements in both visual quality and semantic alignment.
- Score: 48.36302380755874
- License:
- Abstract: Recent progress in generative diffusion models has greatly advanced text-to-video generation. While text-to-video models trained on large-scale, diverse datasets can produce varied outputs, these generations often deviate from user preferences, highlighting the need for preference alignment on pre-trained models. Although Direct Preference Optimization (DPO) has demonstrated significant improvements in language and image generation, we pioneer its adaptation to video diffusion models and propose a VideoDPO pipeline by making several key adjustments. Unlike previous image alignment methods that focus solely on either (i) visual quality or (ii) semantic alignment between text and videos, we comprehensively consider both dimensions and construct a preference score accordingly, which we term the OmniScore. We design a pipeline to automatically collect preference pair data based on the proposed OmniScore and discover that re-weighting these pairs based on the score significantly impacts overall preference alignment. Our experiments demonstrate substantial improvements in both visual quality and semantic alignment, ensuring that no preference aspect is neglected. Code and data will be shared at https://videodpo.github.io/.
Related papers
- DreamDPO: Aligning Text-to-3D Generation with Human Preferences via Direct Preference Optimization [75.55167570591063]
We propose DreamDPO, an optimization-based framework that integrates human preferences into the 3D generation process.
DreamDPO reduces reliance on precise pointwise quality evaluations while enabling fine-grained controllability.
Experiments demonstrate that DreamDPO achieves competitive results, and provides higher-quality and more controllable 3D content.
arXiv Detail & Related papers (2025-02-05T11:03:08Z) - IPO: Iterative Preference Optimization for Text-to-Video Generation [15.763879468841818]
We introduce an Iterative Preference Optimization strategy to enhance generated video quality by incorporating human feedback.
IPO exploits a critic model to justify video generations for pairwise ranking as in Direct Preference Optimization or point-wise scoring.
In addition, IPO incorporates the critic model with the multi-modality large language model, which enables it to automatically assign preference labels without need of retraining or relabeling.
arXiv Detail & Related papers (2025-02-04T08:14:34Z) - Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search [23.3627657867351]
An alignment problem has attracted huge attention, where we steer the output of diffusion models based on some quantity on the goodness of the content.
We propose diffusion latent beam search with lookahead estimator, which can select better diffusion latent to maximize a given alignment reward.
We demonstrate that our method improves the perceptual quality based on the calibrated reward, without model parameter update.
arXiv Detail & Related papers (2025-01-31T16:09:30Z) - Personalized Preference Fine-tuning of Diffusion Models [75.22218338096316]
We introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences.
With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way.
Our approach achieves an average win rate of 76% over Stable Cascade, generating images that more accurately reflect specific user preferences.
arXiv Detail & Related papers (2025-01-11T22:38:41Z) - Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM [54.2320450886902]
Text-to-video models have made remarkable advancements through optimization on high-quality text-video pairs.
Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware.
We introduce Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model.
arXiv Detail & Related papers (2024-12-19T18:32:21Z) - Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization [14.50339880957898]
We aim to improve the preference optimization pipeline by taking a closer look at preference data generation and training regularization techniques.
For preference data generation, we propose an iterative pairwise ranking mechanism that derives preference ranking of completions using pairwise comparison signals.
For training regularization, we observe that preference optimization tends to achieve better convergence when the LLM predicted likelihood of preferred samples gets slightly reduced.
arXiv Detail & Related papers (2024-11-07T23:03:11Z) - Scalable Ranked Preference Optimization for Text-to-Image Generation [76.16285931871948]
We investigate a scalable approach for collecting large-scale and fully synthetic datasets for DPO training.
The preferences for paired images are generated using a pre-trained reward function, eliminating the need for involving humans in the annotation process.
We introduce RankDPO to enhance DPO-based methods using the ranking feedback.
arXiv Detail & Related papers (2024-10-23T16:42:56Z) - VideoCon: Robust Video-Language Alignment via Contrast Captions [80.08882631838914]
Video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions.
Our work identifies a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order.
Our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks.
arXiv Detail & Related papers (2023-11-15T19:51:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.