VideoDPO: Omni-Preference Alignment for Video Diffusion Generation
- URL: http://arxiv.org/abs/2412.14167v1
- Date: Wed, 18 Dec 2024 18:59:49 GMT
- Title: VideoDPO: Omni-Preference Alignment for Video Diffusion Generation
- Authors: Runtao Liu, Haoyu Wu, Zheng Ziqiang, Chen Wei, Yingqing He, Renjie Pi, Qifeng Chen,
- Abstract summary: Direct Preference Optimization (DPO) has demonstrated significant improvements in language and image generation.<n>We propose a VideoDPO pipeline by making several key adjustments.<n>Our experiments demonstrate substantial improvements in both visual quality and semantic alignment.
- Score: 48.36302380755874
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent progress in generative diffusion models has greatly advanced text-to-video generation. While text-to-video models trained on large-scale, diverse datasets can produce varied outputs, these generations often deviate from user preferences, highlighting the need for preference alignment on pre-trained models. Although Direct Preference Optimization (DPO) has demonstrated significant improvements in language and image generation, we pioneer its adaptation to video diffusion models and propose a VideoDPO pipeline by making several key adjustments. Unlike previous image alignment methods that focus solely on either (i) visual quality or (ii) semantic alignment between text and videos, we comprehensively consider both dimensions and construct a preference score accordingly, which we term the OmniScore. We design a pipeline to automatically collect preference pair data based on the proposed OmniScore and discover that re-weighting these pairs based on the score significantly impacts overall preference alignment. Our experiments demonstrate substantial improvements in both visual quality and semantic alignment, ensuring that no preference aspect is neglected. Code and data will be shared at https://videodpo.github.io/.
Related papers
- VPO: Aligning Text-to-Video Generation Models with Prompt Optimization [80.86205966195593]
Video generation models are typically trained on text-to-video pairs with highly detailed and carefully crafted descriptions.
We introduce VPO, a principled framework that optimize prompts based on three core principles: harmlessness, accuracy, and helpfulness.
Our experiments demonstrate that VPO significantly improves safety, alignment, and video quality compared to baseline methods.
arXiv Detail & Related papers (2025-03-26T12:28:20Z) - DreamDPO: Aligning Text-to-3D Generation with Human Preferences via Direct Preference Optimization [75.55167570591063]
We propose DreamDPO, an optimization-based framework that integrates human preferences into the 3D generation process.
DreamDPO reduces reliance on precise pointwise quality evaluations while enabling fine-grained controllability.
Experiments demonstrate that DreamDPO achieves competitive results, and provides higher-quality and more controllable 3D content.
arXiv Detail & Related papers (2025-02-05T11:03:08Z) - Personalized Preference Fine-tuning of Diffusion Models [75.22218338096316]
We introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences.
With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way.
Our approach achieves an average win rate of 76% over Stable Cascade, generating images that more accurately reflect specific user preferences.
arXiv Detail & Related papers (2025-01-11T22:38:41Z) - Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM [54.2320450886902]
Text-to-video models have made remarkable advancements through optimization on high-quality text-video pairs.
Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware.
We introduce Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model.
arXiv Detail & Related papers (2024-12-19T18:32:21Z) - VideoSAVi: Self-Aligned Video Language Models without Human Supervision [0.6854849895338531]
VideoSAVi is a self-training pipeline that enables Video-LLMs to reason over video content without external supervision.
VideoSAVi achieves state-of-the-art performance on MVBench (74.0%) and delivers significant improvements.
Our model-agnostic approach is computationally efficient, requiring only 32 frames.
arXiv Detail & Related papers (2024-12-01T00:33:05Z) - Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization [14.50339880957898]
We aim to improve the preference optimization pipeline by taking a closer look at preference data generation and training regularization techniques.
For preference data generation, we propose an iterative pairwise ranking mechanism that derives preference ranking of completions using pairwise comparison signals.
For training regularization, we observe that preference optimization tends to achieve better convergence when the LLM predicted likelihood of preferred samples gets slightly reduced.
arXiv Detail & Related papers (2024-11-07T23:03:11Z) - Scalable Ranked Preference Optimization for Text-to-Image Generation [76.16285931871948]
We investigate a scalable approach for collecting large-scale and fully synthetic datasets for DPO training.
The preferences for paired images are generated using a pre-trained reward function, eliminating the need for involving humans in the annotation process.
We introduce RankDPO to enhance DPO-based methods using the ranking feedback.
arXiv Detail & Related papers (2024-10-23T16:42:56Z) - Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment [57.0121616203175]
We propose FiSAO, a novel self-alignment method that utilizes the model's own visual encoder as a fine-grained verifier to improve vision-language alignment.
By leveraging token-level feedback from the vision encoder, FiSAO significantly improves vision-language alignment, even surpassing traditional preference tuning methods that require additional data.
arXiv Detail & Related papers (2024-10-18T03:34:32Z) - MAMA: Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning [34.259833094575285]
MAMA is a new approach to learning video-language representations by utilizing a contrastive objective with a subtractive angular margin.
MAMA improves video-language representations and achieve superior performances on commonly used video question answering and text-video retrieval datasets.
arXiv Detail & Related papers (2024-07-04T09:52:17Z) - VideoCon: Robust Video-Language Alignment via Contrast Captions [80.08882631838914]
Video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions.
Our work identifies a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order.
Our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks.
arXiv Detail & Related papers (2023-11-15T19:51:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.