GenEval: An Object-Focused Framework for Evaluating Text-to-Image
Alignment
- URL: http://arxiv.org/abs/2310.11513v1
- Date: Tue, 17 Oct 2023 18:20:03 GMT
- Title: GenEval: An Object-Focused Framework for Evaluating Text-to-Image
Alignment
- Authors: Dhruba Ghosh, Hanna Hajishirzi, Ludwig Schmidt
- Abstract summary: We introduce GenEval, an object-focused framework to evaluate compositional image properties.
We show that current object detection models can be leveraged to evaluate text-to-image models.
We then evaluate several open-source text-to-image models and analyze their relative generative capabilities.
- Score: 26.785655363790312
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent breakthroughs in diffusion models, multimodal pretraining, and
efficient finetuning have led to an explosion of text-to-image generative
models. Given human evaluation is expensive and difficult to scale, automated
methods are critical for evaluating the increasingly large number of new
models. However, most current automated evaluation metrics like FID or
CLIPScore only offer a holistic measure of image quality or image-text
alignment, and are unsuited for fine-grained or instance-level analysis. In
this paper, we introduce GenEval, an object-focused framework to evaluate
compositional image properties such as object co-occurrence, position, count,
and color. We show that current object detection models can be leveraged to
evaluate text-to-image models on a variety of generation tasks with strong
human agreement, and that other discriminative vision models can be linked to
this pipeline to further verify properties like object color. We then evaluate
several open-source text-to-image models and analyze their relative generative
capabilities on our benchmark. We find that recent models demonstrate
significant improvement on these tasks, though they are still lacking in
complex capabilities such as spatial relations and attribute binding. Finally,
we demonstrate how GenEval might be used to help discover existing failure
modes, in order to inform development of the next generation of text-to-image
models. Our code to run the GenEval framework is publicly available at
https://github.com/djghosh13/geneval.
Related papers
- TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models [39.06617653124486]
We introduce a new evaluation framework called TypeScore to assess a model's ability to generate images with high-fidelity embedded text.
Our proposed metric demonstrates greater resolution than CLIPScore to differentiate popular image generation models.
arXiv Detail & Related papers (2024-11-02T07:56:54Z) - KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities [93.74881034001312]
We conduct a systematic study on the fidelity of entities in text-to-image generation models.
We focus on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals.
Our findings reveal that even the most advanced text-to-image models often fail to generate entities with accurate visual details.
arXiv Detail & Related papers (2024-10-15T17:50:37Z) - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z) - ImagenHub: Standardizing the evaluation of conditional image generation
models [48.51117156168]
This paper proposes ImagenHub, which is a one-stop library to standardize the inference and evaluation of all conditional image generation models.
We design two human evaluation scores, i.e. Semantic Consistency and Perceptual Quality, along with comprehensive guidelines to evaluate generated images.
Our human evaluation achieves a high inter-worker agreement of Krippendorff's alpha on 76% models with a value higher than 0.4.
arXiv Detail & Related papers (2023-10-02T19:41:42Z) - T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional
Text-to-image Generation [62.71574695256264]
T2I-CompBench is a comprehensive benchmark for open-world compositional text-to-image generation.
We propose several evaluation metrics specifically designed to evaluate compositional text-to-image generation.
We introduce a new approach, Generative mOdel fine-tuning with Reward-driven Sample selection (GORS) to boost the compositional text-to-image generation abilities.
arXiv Detail & Related papers (2023-07-12T17:59:42Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z) - ImageNet-E: Benchmarking Neural Network Robustness via Attribute Editing [45.14977000707886]
Higher accuracy on ImageNet usually leads to better robustness against different corruptions.
We create a toolkit for object editing with controls of backgrounds, sizes, positions, and directions.
We evaluate the performance of current deep learning models, including both convolutional neural networks and vision transformers.
arXiv Detail & Related papers (2023-03-30T02:02:32Z) - Counterfactual Edits for Generative Evaluation [0.0]
We propose a framework for the evaluation and explanation of synthesized results based on concepts instead of pixels.
Our framework exploits knowledge-based counterfactual edits that underline which objects or attributes should be inserted, removed, or replaced from generated images.
Global explanations produced by accumulating local edits can also reveal what concepts a model cannot generate in total.
arXiv Detail & Related papers (2023-03-02T20:10:18Z) - Re-Imagen: Retrieval-Augmented Text-to-Image Generator [58.60472701831404]
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
arXiv Detail & Related papers (2022-09-29T00:57:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.