Learning Multi-dimensional Human Preference for Text-to-Image Generation
- URL: http://arxiv.org/abs/2405.14705v1
- Date: Thu, 23 May 2024 15:39:43 GMT
- Title: Learning Multi-dimensional Human Preference for Text-to-Image Generation
- Authors: Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, Zhongyuan Wang,
- Abstract summary: We propose the Multi-dimensional Preference Score (MPS), the first multi-dimensional preference scoring model for the evaluation of text-to-image models.
The MPS introduces the preference condition module upon CLIP model to learn these diverse preferences.
It is trained based on our Multi-dimensional Human Preference (MHP) dataset, which comprises 918,315 human preference choices across four dimensions.
- Score: 18.10755131392223
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current metrics for text-to-image models typically rely on statistical metrics which inadequately represent the real preference of humans. Although recent work attempts to learn these preferences via human annotated images, they reduce the rich tapestry of human preference to a single overall score. However, the preference results vary when humans evaluate images with different aspects. Therefore, to learn the multi-dimensional human preferences, we propose the Multi-dimensional Preference Score (MPS), the first multi-dimensional preference scoring model for the evaluation of text-to-image models. The MPS introduces the preference condition module upon CLIP model to learn these diverse preferences. It is trained based on our Multi-dimensional Human Preference (MHP) Dataset, which comprises 918,315 human preference choices across four dimensions (i.e., aesthetics, semantic alignment, detail quality and overall assessment) on 607,541 images. The images are generated by a wide range of latest text-to-image models. The MPS outperforms existing scoring methods across 3 datasets in 4 dimensions, enabling it a promising metric for evaluating and improving text-to-image generation.
Related papers
- Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback [87.37721254914476]
We introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality.
We train a performance prediction model to predict a reward model's performance on an arbitrary combination of human and LM annotations.
We show that the selected hybrid mixture achieves better reward model performance compared to using either one exclusively.
arXiv Detail & Related papers (2024-10-24T20:04:15Z) - MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models [85.30735602813093]
Multi-Image Augmented Direct Preference Optimization (MIA-DPO) is a visual preference alignment approach that effectively handles multi-image inputs.
MIA-DPO mitigates the scarcity of diverse multi-image training data by extending single-image data with unrelated images arranged in grid collages or pic-in-pic formats.
arXiv Detail & Related papers (2024-10-23T07:56:48Z) - PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences [6.398937923320069]
We propose PAL, a framework to model human preference complementary to existing pretraining strategies.
We show that PAL achieves competitive reward model accuracy compared to strong baselines.
arXiv Detail & Related papers (2024-06-12T17:54:54Z) - Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects.
We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z) - Stellar: Systematic Evaluation of Human-Centric Personalized
Text-to-Image Methods [52.806258774051216]
We focus on text-to-image systems that input a single image of an individual and ground the generation process along with text describing the desired visual context.
We introduce a standardized dataset (Stellar) that contains personalized prompts coupled with images of individuals that is an order of magnitude larger than existing relevant datasets and where rich semantic ground-truth annotations are readily available.
We derive a simple yet efficient, personalized text-to-image baseline that does not require test-time fine-tuning for each subject and which sets quantitatively and in human trials a new SoTA.
arXiv Detail & Related papers (2023-12-11T04:47:39Z) - Human Preference Score v2: A Solid Benchmark for Evaluating Human
Preferences of Text-to-Image Synthesis [38.70605308204128]
Recent text-to-image generative models can generate high-fidelity images from text inputs.
HPD v2 captures human preferences on images from a wide range of sources.
HPD v2 comprises 798,090 human preference choices on 433,760 pairs of images.
arXiv Detail & Related papers (2023-06-15T17:59:31Z) - ImageReward: Learning and Evaluating Human Preferences for Text-to-Image
Generation [30.977582244445742]
We build ImageReward, the first general-purpose text-to-image human preference reward model.
Its training is based on our systematic annotation pipeline including rating and ranking.
In human evaluation, ImageReward outperforms existing scoring models and metrics.
arXiv Detail & Related papers (2023-04-12T16:58:13Z) - Human Preference Score: Better Aligning Text-to-Image Models with Human
Preference [41.270068272447055]
We collect a dataset of human choices on generated images from the Stable Foundation Discord channel.
Our experiments demonstrate that current evaluation metrics for generative models do not correlate well with human choices.
We propose a simple yet effective method to adapt Stable Diffusion to better align with human preferences.
arXiv Detail & Related papers (2023-03-25T10:09:03Z) - Aligning Text-to-Image Models using Human Feedback [104.76638092169604]
Current text-to-image models often generate images that are inadequately aligned with text prompts.
We propose a fine-tuning method for aligning such models using human feedback.
Our results demonstrate the potential for learning from human feedback to significantly improve text-to-image models.
arXiv Detail & Related papers (2023-02-23T17:34:53Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.