VPO: Aligning Text-to-Video Generation Models with Prompt Optimization
- URL: http://arxiv.org/abs/2503.20491v1
- Date: Wed, 26 Mar 2025 12:28:20 GMT
- Title: VPO: Aligning Text-to-Video Generation Models with Prompt Optimization
- Authors: Jiale Cheng, Ruiliang Lyu, Xiaotao Gu, Xiao Liu, Jiazheng Xu, Yida Lu, Jiayan Teng, Zhuoyi Yang, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang,
- Abstract summary: Video generation models are typically trained on text-to-video pairs with highly detailed and carefully crafted descriptions.<n>We introduce VPO, a principled framework that optimize prompts based on three core principles: harmlessness, accuracy, and helpfulness.<n>Our experiments demonstrate that VPO significantly improves safety, alignment, and video quality compared to baseline methods.
- Score: 80.86205966195593
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video generation models have achieved remarkable progress in text-to-video tasks. These models are typically trained on text-video pairs with highly detailed and carefully crafted descriptions, while real-world user inputs during inference are often concise, vague, or poorly structured. This gap makes prompt optimization crucial for generating high-quality videos. Current methods often rely on large language models (LLMs) to refine prompts through in-context learning, but suffer from several limitations: they may distort user intent, omit critical details, or introduce safety risks. Moreover, they optimize prompts without considering the impact on the final video quality, which can lead to suboptimal results. To address these issues, we introduce VPO, a principled framework that optimizes prompts based on three core principles: harmlessness, accuracy, and helpfulness. The generated prompts faithfully preserve user intents and, more importantly, enhance the safety and quality of generated videos. To achieve this, VPO employs a two-stage optimization approach. First, we construct and refine a supervised fine-tuning (SFT) dataset based on principles of safety and alignment. Second, we introduce both text-level and video-level feedback to further optimize the SFT model with preference learning. Our extensive experiments demonstrate that VPO significantly improves safety, alignment, and video quality compared to baseline methods. Moreover, VPO shows strong generalization across video generation models. Furthermore, we demonstrate that VPO could outperform and be combined with RLHF methods on video generation models, underscoring the effectiveness of VPO in aligning video generation models. Our code and data are publicly available at https://github.com/thu-coai/VPO.
Related papers
- TEMPO: Temporal Preference Optimization of Video LLMs via Difficulty Scheduling and Pre-SFT Alignment [48.94844127553743]
TEMPO is a systematic framework that enhances temporal reasoning capabilities of Video Large Language Models.<n>Our approach consistently improves Video LLM performance across multiple benchmarks with a relatively small set of self-generated DPO data.<n>Our findings highlight our TEMPO as a scalable and efficient complement to SFT-based methods, paving the way for developing reliable Video LLMs.
arXiv Detail & Related papers (2025-03-21T08:00:29Z) - HuViDPO:Enhancing Video Generation through Direct Preference Optimization for Human-Centric Alignment [13.320911720001277]
We introduce the strategy of Direct Preference Optimization (DPO) into text-to-video (T2V) tasks.<n>Existing T2V generation methods lack a well-formed pipeline with exact loss function to guide the alignment of generated videos with human preferences.
arXiv Detail & Related papers (2025-02-02T16:55:42Z) - OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization [30.6130504613716]
We introduce OnlineVPO, a preference learning approach tailored specifically for video diffusion models.<n>By employing the video reward model to offer concise video feedback on the fly, OnlineVPO offers effective and efficient preference guidance.
arXiv Detail & Related papers (2024-12-19T18:34:50Z) - Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM [54.2320450886902]
Text-to-video models have made remarkable advancements through optimization on high-quality text-video pairs.<n>Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware.<n>We introduce Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model.
arXiv Detail & Related papers (2024-12-19T18:32:21Z) - VideoSAVi: Self-Aligned Video Language Models without Human Supervision [0.6854849895338531]
VideoSAVi is a novel self-training pipeline for vision-language models (VLMs)<n>It generates its own training data without extensive manual annotation.<n>VideoSAVi shows significant improvements across multiple benchmarks.
arXiv Detail & Related papers (2024-12-01T00:33:05Z) - Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward [118.65089648651308]
This paper introduces a novel framework that utilizes detailed video captions as a proxy of video content.
We show that applying this tailored reward through DPO significantly improves the performance of video LMMs on video Question Answering (QA) tasks.
arXiv Detail & Related papers (2024-04-01T17:28:16Z) - E2HQV: High-Quality Video Generation from Event Camera via
Theory-Inspired Model-Aided Deep Learning [53.63364311738552]
Bio-inspired event cameras or dynamic vision sensors are capable of capturing per-pixel brightness changes (called event-streams) in high temporal resolution and high dynamic range.
It calls for events-to-video (E2V) solutions which take event-streams as input and generate high quality video frames for intuitive visualization.
We propose textbfE2HQV, a novel E2V paradigm designed to produce high-quality video frames from events.
arXiv Detail & Related papers (2024-01-16T05:10:50Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.