VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation
- URL: http://arxiv.org/abs/2412.21059v2
- Date: Sun, 23 Mar 2025 09:37:33 GMT
- Title: VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation
- Authors: Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, Jiayan Teng, Zhuoyi Yang, Wendi Zheng, Xiao Liu, Ming Ding, Xiaohan Zhang, Xiaotao Gu, Shiyu Huang, Minlie Huang, Jie Tang, Yuxiao Dong,
- Abstract summary: We present VisionReward, a framework for learning human visual preferences in both image and video generation.<n>VisionReward can significantly outperform existing image and video reward models on both machine metrics and human evaluation.
- Score: 70.68566282567207
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual generative models have achieved remarkable progress in synthesizing photorealistic images and videos, yet aligning their outputs with human preferences across critical dimensions remains a persistent challenge. Though reinforcement learning from human feedback offers promise for preference alignment, existing reward models for visual generation face limitations, including black-box scoring without interpretability and potentially resultant unexpected biases. We present VisionReward, a general framework for learning human visual preferences in both image and video generation. Specifically, we employ a hierarchical visual assessment framework to capture fine-grained human preferences, and leverages linear weighting to enable interpretable preference learning. Furthermore, we propose a multi-dimensional consistent strategy when using VisionReward as a reward model during preference optimization for visual generation. Experiments show that VisionReward can significantly outperform existing image and video reward models on both machine metrics and human evaluation. Notably, VisionReward surpasses VideoScore by 17.2% in preference prediction accuracy, and text-to-video models with VisionReward achieve a 31.6% higher pairwise win rate compared to the same models using VideoScore. All code and datasets are provided at https://github.com/THUDM/VisionReward.
Related papers
- Aligning Anime Video Generation with Human Feedback [31.701968335565393]
Anime video generation faces significant challenges due to the scarcity of anime data and unusual motion patterns.<n>Existing reward models, designed primarily for real-world videos, fail to capture the unique appearance and consistency requirements of anime.<n>We propose a pipeline to enhance anime video generation by leveraging human feedback for better alignment.
arXiv Detail & Related papers (2025-04-14T09:49:34Z) - Test-Time Reasoning Through Visual Human Preferences with VLMs and Soft Rewards [45.84931291646799]
Using datasets such as ImageReward and Human Preference Score v2 (HPSv2), our models achieve accuracies of 64.9% on the ImageReward test set and 65.4% on HPSv2.
Our findings can be a strong mile-stone that will enhance text-to-vision models even further.
arXiv Detail & Related papers (2025-03-25T15:30:21Z) - Unified Reward Model for Multimodal Understanding and Generation [32.22714522329413]
This paper proposes UnifiedReward, the first unified reward model for multimodal understanding and generation assessment.
We first develop UnifiedReward on our constructed large-scale human preference dataset, including both image and video generation/understanding tasks.
arXiv Detail & Related papers (2025-03-07T08:36:05Z) - Can Generative Video Models Help Pose Estimation? [42.10672365565019]
Pairwise pose estimation from images with little or no overlap is an open challenge in computer vision.
Inspired by the human ability to infer spatial relationships from diverse scenes, we propose a novel approach, InterPose.
We propose a video model to hallucinate intermediate frames between two input images, effectively creating a dense, visual transition.
arXiv Detail & Related papers (2024-12-20T18:58:24Z) - VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models [111.5892290894904]
VBench is a benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions.
We provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception.
VBench++ supports evaluating text-to-video and image-to-video.
arXiv Detail & Related papers (2024-11-20T17:54:41Z) - Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos [66.1935609072708]
Key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is.<n>We propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels.<n>During inference, our model takes as input only a multi-view video--no language or camera poses--and returns the best viewpoint to watch at each timestep.
arXiv Detail & Related papers (2024-11-13T16:31:08Z) - When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability.
We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks.
Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - NARAIM: Native Aspect Ratio Autoregressive Image Models [26.26674614731835]
We propose NARAIM, a vision model pre-trained with an autoregressive objective that uses images in their native aspect ratio.
By maintaining the native aspect ratio, we preserve the original spatial context, thereby enhancing the model's ability to interpret visual information.
arXiv Detail & Related papers (2024-10-13T21:13:48Z) - Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms [91.19304518033144]
We aim to align vision models with human aesthetic standards in a retrieval system.
We propose a preference-based reinforcement learning method that fine-tunes the vision models to better align the vision models with human aesthetics.
arXiv Detail & Related papers (2024-06-13T17:59:20Z) - Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects.
We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z) - Revisiting Feature Prediction for Learning Visual Representations from Video [62.08833572467379]
V-JEPA is a collection of vision models trained solely using a feature prediction objective.
The models are trained on 2 million videos collected from public datasets.
Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks.
arXiv Detail & Related papers (2024-02-15T18:59:11Z) - ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling [35.098725056881655]
Large vision language models (LVLMs) have shown unprecedented visual reasoning capabilities.
The generated text often suffers from inaccurate grounding in the visual input, resulting in errors such as hallucination of nonexistent scene elements.
We introduce a novel framework, ViGoR, that utilizes fine-grained reward modeling to significantly enhance the visual grounding of LVLMs over pre-trained baselines.
arXiv Detail & Related papers (2024-02-09T01:00:14Z) - VBench: Comprehensive Benchmark Suite for Video Generative Models [100.43756570261384]
VBench is a benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions.
We provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception.
We will open-source VBench, including all prompts, evaluation methods, generated videos, and human preference annotations.
arXiv Detail & Related papers (2023-11-29T18:39:01Z) - Retargeting video with an end-to-end framework [14.270721529264929]
We present an end-to-end RETVI method to retarget videos to arbitrary ratios.
Our system outperforms previous work in quality and running time.
arXiv Detail & Related papers (2023-11-08T04:56:41Z) - NPF-200: A Multi-Modal Eye Fixation Dataset and Method for
Non-Photorealistic Videos [51.409547544747284]
NPF-200 is the first large-scale multi-modal dataset of purely non-photorealistic videos with eye fixations.
We conduct a series of analyses to gain deeper insights into this task.
We propose a universal frequency-aware multi-modal non-photorealistic saliency detection model called NPSNet.
arXiv Detail & Related papers (2023-08-23T14:25:22Z) - Let's Think Frame by Frame with VIP: A Video Infilling and Prediction
Dataset for Evaluating Video Chain-of-Thought [62.619076257298204]
We motivate framing video reasoning as the sequential understanding of a small number of video reasonings.
We introduce VIP, an inference-time challenge dataset designed to explore models' reasoning capabilities through video chain-of-thought.
We benchmark GPT-4, GPT-3, and VICUNA on VIP, demonstrate the performance gap in complex video reasoning tasks, and encourage future work.
arXiv Detail & Related papers (2023-05-23T10:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.