ImagenHub: Standardizing the evaluation of conditional image generation
models
- URL: http://arxiv.org/abs/2310.01596v4
- Date: Sun, 10 Mar 2024 21:41:51 GMT
- Title: ImagenHub: Standardizing the evaluation of conditional image generation
models
- Authors: Max Ku, Tianle Li, Kai Zhang, Yujie Lu, Xingyu Fu, Wenwen Zhuang,
Wenhu Chen
- Abstract summary: This paper proposes ImagenHub, which is a one-stop library to standardize the inference and evaluation of all conditional image generation models.
We design two human evaluation scores, i.e. Semantic Consistency and Perceptual Quality, along with comprehensive guidelines to evaluate generated images.
Our human evaluation achieves a high inter-worker agreement of Krippendorff's alpha on 76% models with a value higher than 0.4.
- Score: 48.51117156168
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, a myriad of conditional image generation and editing models have
been developed to serve different downstream tasks, including text-to-image
generation, text-guided image editing, subject-driven image generation,
control-guided image generation, etc. However, we observe huge inconsistencies
in experimental conditions: datasets, inference, and evaluation metrics -
render fair comparisons difficult. This paper proposes ImagenHub, which is a
one-stop library to standardize the inference and evaluation of all the
conditional image generation models. Firstly, we define seven prominent tasks
and curate high-quality evaluation datasets for them. Secondly, we built a
unified inference pipeline to ensure fair comparison. Thirdly, we design two
human evaluation scores, i.e. Semantic Consistency and Perceptual Quality,
along with comprehensive guidelines to evaluate generated images. We train
expert raters to evaluate the model outputs based on the proposed metrics. Our
human evaluation achieves a high inter-worker agreement of Krippendorff's alpha
on 76% models with a value higher than 0.4. We comprehensively evaluated a
total of around 30 models and observed three key takeaways: (1) the existing
models' performance is generally unsatisfying except for Text-guided Image
Generation and Subject-driven Image Generation, with 74% models achieving an
overall score lower than 0.5. (2) we examined the claims from published papers
and found 83% of them hold with a few exceptions. (3) None of the existing
automatic metrics has a Spearman's correlation higher than 0.2 except
subject-driven image generation. Moving forward, we will continue our efforts
to evaluate newly published models and update our leaderboard to keep track of
the progress in conditional image generation.
Related papers
- Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models [16.00576040281808]
We propose a novel framework called Image2Text2Image to evaluate image captioning models.
A high similarity score suggests that the model has produced a faithful textual description, while a low score highlights discrepancies.
Our framework does not rely on human-annotated captions reference, making it a valuable tool for assessing image captioning models.
arXiv Detail & Related papers (2024-11-08T17:07:01Z) - GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation [103.3465421081531]
VQAScore is a metric measuring the likelihood that a VQA model views an image as accurately depicting the prompt.
Ranking by VQAScore is 2x to 3x more effective than other scoring methods like PickScore, HPSv2, and ImageReward.
We release a new GenAI-Rank benchmark with over 40,000 human ratings to evaluate scoring metrics on ranking images generated from the same prompt.
arXiv Detail & Related papers (2024-06-19T18:00:07Z) - Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation [52.509092010267665]
We introduce LlamaGen, a new family of image generation models that apply original next-token prediction'' paradigm of large language models to visual generation domain.
It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly.
arXiv Detail & Related papers (2024-06-10T17:59:52Z) - Evaluating Text-to-Image Generative Models: An Empirical Study on Human Image Synthesis [21.619269792415903]
We present an empirical study introducing a nuanced evaluation framework for text-to-image (T2I) generative models.
Our framework categorizes evaluations into two distinct groups: first, focusing on image qualities such as aesthetics and realism, and second, examining text conditions through concept coverage and fairness.
arXiv Detail & Related papers (2024-03-08T07:41:47Z) - GenEval: An Object-Focused Framework for Evaluating Text-to-Image
Alignment [26.785655363790312]
We introduce GenEval, an object-focused framework to evaluate compositional image properties.
We show that current object detection models can be leveraged to evaluate text-to-image models.
We then evaluate several open-source text-to-image models and analyze their relative generative capabilities.
arXiv Detail & Related papers (2023-10-17T18:20:03Z) - Likelihood-Based Text-to-Image Evaluation with Patch-Level Perceptual
and Semantic Credit Assignment [48.835298314274254]
We propose to evaluate text-to-image generation performance by directly estimating the likelihood of the generated images.
A higher likelihood indicates better perceptual quality and better text-image alignment.
It can successfully assess the generation ability of these models with as few as a hundred samples.
arXiv Detail & Related papers (2023-08-16T17:26:47Z) - Re-Imagen: Retrieval-Augmented Text-to-Image Generator [58.60472701831404]
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
arXiv Detail & Related papers (2022-09-29T00:57:28Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.