Human Preference Score v2: A Solid Benchmark for Evaluating Human
  Preferences of Text-to-Image Synthesis
        - URL: http://arxiv.org/abs/2306.09341v2
- Date: Mon, 25 Sep 2023 08:19:23 GMT
- Title: Human Preference Score v2: A Solid Benchmark for Evaluating Human
  Preferences of Text-to-Image Synthesis
- Authors: Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao,
  Hongsheng Li
- Abstract summary: Recent text-to-image generative models can generate high-fidelity images from text inputs.
HPD v2 captures human preferences on images from a wide range of sources.
HPD v2 comprises 798,090 human preference choices on 433,760 pairs of images.
- Score: 38.70605308204128
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Recent text-to-image generative models can generate high-fidelity images from
text inputs, but the quality of these generated images cannot be accurately
evaluated by existing evaluation metrics. To address this issue, we introduce
Human Preference Dataset v2 (HPD v2), a large-scale dataset that captures human
preferences on images from a wide range of sources. HPD v2 comprises 798,090
human preference choices on 433,760 pairs of images, making it the largest
dataset of its kind. The text prompts and images are deliberately collected to
eliminate potential bias, which is a common issue in previous datasets. By
fine-tuning CLIP on HPD v2, we obtain Human Preference Score v2 (HPS v2), a
scoring model that can more accurately predict human preferences on generated
images. Our experiments demonstrate that HPS v2 generalizes better than
previous metrics across various image distributions and is responsive to
algorithmic improvements of text-to-image generative models, making it a
preferable evaluation metric for these models. We also investigate the design
of the evaluation prompts for text-to-image generative models, to make the
evaluation stable, fair and easy-to-use. Finally, we establish a benchmark for
text-to-image generative models using HPS v2, which includes a set of recent
text-to-image models from the academic, community and industry. The code and
dataset is available at https://github.com/tgxs002/HPSv2 .
 
      
        Related papers
        - HPSv3: Towards Wide-Spectrum Human Preference Score [35.108959799842694]
 We release the first wide-spectrum human preference dataset integrating 1.08M text-image pairs and 1.17M annotated pairwise comparisons.<n>We introduce a VLM-based preference model trained using an uncertainty-aware ranking loss for fine-grained ranking.<n>Besides, we propose Chain-of-Human-Preference (CoHP), an iterative image refinement method that enhances quality without extra data.
 arXiv  Detail & Related papers  (2025-08-05T17:17:13Z)
- EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive   Human Annotations for Text-to-Image Generation Model Evaluation [29.176750442205325]
 In this study, we contribute an EvalMuse-40K benchmark, gathering 40K image-text pairs with fine-grained human annotations for image-text alignment-related tasks.
We introduce two new methods to evaluate the image-text alignment capabilities of T2I models.
 arXiv  Detail & Related papers  (2024-12-24T04:08:25Z)
- Image Regeneration: Evaluating Text-to-Image Model via Generating   Identical Image with Multimodal Large Language Models [54.052963634384945]
 We introduce the Image Regeneration task to assess text-to-image models.
We use GPT4V to bridge the gap between the reference image and the text input for the T2I model.
We also present ImageRepainter framework to enhance the quality of generated images.
 arXiv  Detail & Related papers  (2024-11-14T13:52:43Z)
- Image2Text2Image: A Novel Framework for Label-Free Evaluation of   Image-to-Text Generation with Text-to-Image Diffusion Models [16.00576040281808]
 We propose a novel framework called Image2Text2Image to evaluate image captioning models.
A high similarity score suggests that the model has produced a faithful textual description, while a low score highlights discrepancies.
Our framework does not rely on human-annotated captions reference, making it a valuable tool for assessing image captioning models.
 arXiv  Detail & Related papers  (2024-11-08T17:07:01Z)
- Scalable Ranked Preference Optimization for Text-to-Image Generation [76.16285931871948]
 We investigate a scalable approach for collecting large-scale and fully synthetic datasets for DPO training.
The preferences for paired images are generated using a pre-trained reward function, eliminating the need for involving humans in the annotation process.
We introduce RankDPO to enhance DPO-based methods using the ranking feedback.
 arXiv  Detail & Related papers  (2024-10-23T16:42:56Z)
- Learning Multi-dimensional Human Preference for Text-to-Image Generation [18.10755131392223]
 We propose the Multi-dimensional Preference Score (MPS), the first multi-dimensional preference scoring model for the evaluation of text-to-image models.
The MPS introduces the preference condition module upon CLIP model to learn these diverse preferences.
It is trained based on our Multi-dimensional Human Preference (MHP) dataset, which comprises 918,315 human preference choices across four dimensions.
 arXiv  Detail & Related papers  (2024-05-23T15:39:43Z)
- Confidence-aware Reward Optimization for Fine-tuning Text-to-Image   Models [85.96013373385057]
 Fine-tuning text-to-image models with reward functions trained on human feedback data has proven effective for aligning model behavior with human intent.
However, excessive optimization with such reward models, which serve as mere proxy objectives, can compromise the performance of fine-tuned models.
We propose TextNorm, a method that enhances alignment based on a measure of reward model confidence estimated across a set of semantically contrastive text prompts.
 arXiv  Detail & Related papers  (2024-04-02T11:40:38Z)
- Human Preference Score: Better Aligning Text-to-Image Models with Human
  Preference [41.270068272447055]
 We collect a dataset of human choices on generated images from the Stable Foundation Discord channel.
Our experiments demonstrate that current evaluation metrics for generative models do not correlate well with human choices.
We propose a simple yet effective method to adapt Stable Diffusion to better align with human preferences.
 arXiv  Detail & Related papers  (2023-03-25T10:09:03Z)
- Aligning Text-to-Image Models using Human Feedback [104.76638092169604]
 Current text-to-image models often generate images that are inadequately aligned with text prompts.
We propose a fine-tuning method for aligning such models using human feedback.
Our results demonstrate the potential for learning from human feedback to significantly improve text-to-image models.
 arXiv  Detail & Related papers  (2023-02-23T17:34:53Z)
- Photorealistic Text-to-Image Diffusion Models with Deep Language
  Understanding [53.170767750244366]
 Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
 arXiv  Detail & Related papers  (2022-05-23T17:42:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.