LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment
- URL: http://arxiv.org/abs/2412.04814v2
- Date: Tue, 24 Dec 2024 11:57:46 GMT
- Title: LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment
- Authors: Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, Hao Li,
- Abstract summary: This paper proposes LiFT, a novel fine-tuning method leveraging human feedback for T2V model alignment.
We train a reward model LiFT-Critic to learn reward function effectively, which serves as a proxy for human judgment.
Lastly, we leverage the learned reward function to align the T2V model by maximizing the reward-weighted likelihood.
- Score: 15.11363628734519
- License:
- Abstract: Recent advancements in text-to-video (T2V) generative models have shown impressive capabilities. However, these models are still inadequate in aligning synthesized videos with human preferences (e.g., accurately reflecting text descriptions), which is particularly difficult to address, as human preferences are inherently subjective and challenging to formalize as objective functions. Therefore, this paper proposes LiFT, a novel fine-tuning method leveraging human feedback for T2V model alignment. Specifically, we first construct a Human Rating Annotation dataset, LiFT-HRA, consisting of approximately 10k human annotations, each including a score and its corresponding rationale. Based on this, we train a reward model LiFT-Critic to learn reward function effectively, which serves as a proxy for human judgment, measuring the alignment between given videos and human expectations. Lastly, we leverage the learned reward function to align the T2V model by maximizing the reward-weighted likelihood. As a case study, we apply our pipeline to CogVideoX-2B, showing that the fine-tuned model outperforms the CogVideoX-5B across all 16 metrics, highlighting the potential of human feedback in improving the alignment and quality of synthesized videos.
Related papers
- HuViDPO:Enhancing Video Generation through Direct Preference Optimization for Human-Centric Alignment [13.320911720001277]
We introduce the strategy of Direct Preference Optimization (DPO) into text-to-video (T2V) tasks.
Existing T2V generation methods lack a well-formed pipeline with exact loss function to guide the alignment of generated videos with human preferences.
arXiv Detail & Related papers (2025-02-02T16:55:42Z) - Improving Video Generation with Human Feedback [81.48120703718774]
Video generation has achieved significant advances, but issues like unsmooth motion and misalignment between videos and prompts persist.
We develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model.
We introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy.
arXiv Detail & Related papers (2025-01-23T18:55:41Z) - CustomTTT: Motion and Appearance Customized Video Generation via Test-Time Training [35.43906754134253]
We propose CustomTTT, where we can joint custom the appearance and motion of the given video easily.
Since each LoRA is trained individually, we propose a novel test-time training technique to update parameters after combination.
Our method outperforms several state-of-the-art works in both qualitative and quantitative evaluations.
arXiv Detail & Related papers (2024-12-20T08:05:13Z) - Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback [130.090296560882]
We investigate the use of feedback to enhance the object dynamics in text-to-video models.
We show that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions.
arXiv Detail & Related papers (2024-12-03T17:44:23Z) - OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation [27.516068877910254]
We introduce OpenHumanVid, a large-scale and high-quality human-centric video dataset.
Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos.
Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs.
arXiv Detail & Related papers (2024-11-28T07:01:06Z) - Free$^2$Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models [56.289828238673124]
Free$2$Guide is a gradient-free framework for aligning generated videos with text prompts.
We show that Free$2$Guide significantly improves text alignment across various dimensions and enhances the overall quality of generated videos.
arXiv Detail & Related papers (2024-11-26T02:14:47Z) - Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment [54.00254267259069]
We establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date.
The dataset is composed of 10,000 videos generated by 9 different T2V models.
We propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA)
arXiv Detail & Related papers (2024-03-18T16:52:49Z) - UltraFeedback: Boosting Language Models with Scaled AI Feedback [99.4633351133207]
We present textscUltraFeedback, a large-scale, high-quality, and diversified AI feedback dataset.
Our work validates the effectiveness of scaled AI feedback data in constructing strong open-source chat language models.
arXiv Detail & Related papers (2023-10-02T17:40:01Z) - Aligning Text-to-Image Models using Human Feedback [104.76638092169604]
Current text-to-image models often generate images that are inadequately aligned with text prompts.
We propose a fine-tuning method for aligning such models using human feedback.
Our results demonstrate the potential for learning from human feedback to significantly improve text-to-image models.
arXiv Detail & Related papers (2023-02-23T17:34:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.