GenEval: An Object-Focused Framework for Evaluating Text-to-Image
Alignment
- URL: http://arxiv.org/abs/2310.11513v1
- Date: Tue, 17 Oct 2023 18:20:03 GMT
- Title: GenEval: An Object-Focused Framework for Evaluating Text-to-Image
Alignment
- Authors: Dhruba Ghosh, Hanna Hajishirzi, Ludwig Schmidt
- Abstract summary: We introduce GenEval, an object-focused framework to evaluate compositional image properties.
We show that current object detection models can be leveraged to evaluate text-to-image models.
We then evaluate several open-source text-to-image models and analyze their relative generative capabilities.
- Score: 26.785655363790312
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent breakthroughs in diffusion models, multimodal pretraining, and
efficient finetuning have led to an explosion of text-to-image generative
models. Given human evaluation is expensive and difficult to scale, automated
methods are critical for evaluating the increasingly large number of new
models. However, most current automated evaluation metrics like FID or
CLIPScore only offer a holistic measure of image quality or image-text
alignment, and are unsuited for fine-grained or instance-level analysis. In
this paper, we introduce GenEval, an object-focused framework to evaluate
compositional image properties such as object co-occurrence, position, count,
and color. We show that current object detection models can be leveraged to
evaluate text-to-image models on a variety of generation tasks with strong
human agreement, and that other discriminative vision models can be linked to
this pipeline to further verify properties like object color. We then evaluate
several open-source text-to-image models and analyze their relative generative
capabilities on our benchmark. We find that recent models demonstrate
significant improvement on these tasks, though they are still lacking in
complex capabilities such as spatial relations and attribute binding. Finally,
we demonstrate how GenEval might be used to help discover existing failure
modes, in order to inform development of the next generation of text-to-image
models. Our code to run the GenEval framework is publicly available at
https://github.com/djghosh13/geneval.
Related papers
- Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models.
We use GPT4V to bridge the gap between the reference image and the text input for the T2I model.
We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z) - Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models [16.00576040281808]
We propose a novel framework called Image2Text2Image to evaluate image captioning models.
A high similarity score suggests that the model has produced a faithful textual description, while a low score highlights discrepancies.
Our framework does not rely on human-annotated captions reference, making it a valuable tool for assessing image captioning models.
arXiv Detail & Related papers (2024-11-08T17:07:01Z) - TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models [39.06617653124486]
We introduce a new evaluation framework called TypeScore to assess a model's ability to generate images with high-fidelity embedded text.
Our proposed metric demonstrates greater resolution than CLIPScore to differentiate popular image generation models.
arXiv Detail & Related papers (2024-11-02T07:56:54Z) - KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities [93.74881034001312]
We conduct a systematic study on the fidelity of entities in text-to-image generation models.
We focus on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals.
Our findings reveal that even the most advanced text-to-image models often fail to generate entities with accurate visual details.
arXiv Detail & Related papers (2024-10-15T17:50:37Z) - ImagenHub: Standardizing the evaluation of conditional image generation
models [48.51117156168]
This paper proposes ImagenHub, which is a one-stop library to standardize the inference and evaluation of all conditional image generation models.
We design two human evaluation scores, i.e. Semantic Consistency and Perceptual Quality, along with comprehensive guidelines to evaluate generated images.
Our human evaluation achieves a high inter-worker agreement of Krippendorff's alpha on 76% models with a value higher than 0.4.
arXiv Detail & Related papers (2023-10-02T19:41:42Z) - T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional
Text-to-image Generation [62.71574695256264]
T2I-CompBench is a comprehensive benchmark for open-world compositional text-to-image generation.
We propose several evaluation metrics specifically designed to evaluate compositional text-to-image generation.
We introduce a new approach, Generative mOdel fine-tuning with Reward-driven Sample selection (GORS) to boost the compositional text-to-image generation abilities.
arXiv Detail & Related papers (2023-07-12T17:59:42Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z) - ImageNet-E: Benchmarking Neural Network Robustness via Attribute Editing [45.14977000707886]
Higher accuracy on ImageNet usually leads to better robustness against different corruptions.
We create a toolkit for object editing with controls of backgrounds, sizes, positions, and directions.
We evaluate the performance of current deep learning models, including both convolutional neural networks and vision transformers.
arXiv Detail & Related papers (2023-03-30T02:02:32Z) - Re-Imagen: Retrieval-Augmented Text-to-Image Generator [58.60472701831404]
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
arXiv Detail & Related papers (2022-09-29T00:57:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.