Related papers: EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation

EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation

URL: http://arxiv.org/abs/2412.18150v2
Date: Wed, 25 Dec 2024 15:00:36 GMT
Title: EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation
Authors: Shuhao Han, Haotian Fan, Jiachen Fu, Liang Li, Tao Li, Junhui Cui, Yunqiu Wang, Yang Tai, Jingwei Sun, Chunle Guo, Chongyi Li,
Abstract summary: In this study, we contribute an EvalMuse-40K benchmark, gathering 40K image-text pairs with fine-grained human annotations for image-text alignment-related tasks.<n>We introduce two new methods to evaluate the image-text alignment capabilities of T2I models.
Score: 29.176750442205325
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, Text-to-Image (T2I) generation models have achieved significant advancements. Correspondingly, many automated metrics have emerged to evaluate the image-text alignment capabilities of generative models. However, the performance comparison among these automated metrics is limited by existing small datasets. Additionally, these datasets lack the capacity to assess the performance of automated metrics at a fine-grained level. In this study, we contribute an EvalMuse-40K benchmark, gathering 40K image-text pairs with fine-grained human annotations for image-text alignment-related tasks. In the construction process, we employ various strategies such as balanced prompt sampling and data re-annotation to ensure the diversity and reliability of our benchmark. This allows us to comprehensively evaluate the effectiveness of image-text alignment metrics for T2I models. Meanwhile, we introduce two new methods to evaluate the image-text alignment capabilities of T2I models: FGA-BLIP2 which involves end-to-end fine-tuning of a vision-language model to produce fine-grained image-text alignment scores and PN-VQA which adopts a novel positive-negative VQA manner in VQA models for zero-shot fine-grained evaluation. Both methods achieve impressive performance in image-text alignment evaluations. We also use our methods to rank current AIGC models, in which the results can serve as a reference source for future study and promote the development of T2I generation. The data and code will be made publicly available.

Related papers

TIT-Score: Evaluating Long-Prompt Based Text-to-Image Alignment via Text-to-Image-to-Text Consistency [81.17906057429329]
LPG-Bench is a comprehensive benchmark for evaluating long-prompt-based text-to-image generation.<n>We generate 2,600 images from 13 state-of-the-art models and perform comprehensive human-ranked annotations.<n>We introduce a novel zero-shot metric based on text-to-image-to-text consistency, termed TIT, for evaluating long-prompt-generated images.
arXiv Detail & Related papers (2025-10-03T13:25:16Z)
OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation [23.05106664412349]
Text-to-image (T2I) models have garnered significant attention for generating high-quality images aligned with text prompts.<n>OneIG-Bench is a benchmark framework for evaluation of T2I models across multiple dimensions.
arXiv Detail & Related papers (2025-06-09T17:50:21Z)
EvalGIM: A Library for Evaluating Generative Image Models [26.631349186382664]
We introduce EvalGIM, a library for evaluating text-to-image generative models.<n>EvalGIM contains broad support for datasets and metrics used to measure quality, diversity, and consistency.<n>EvalGIM also contains Evaluation Exercises that introduce two new analysis methods for text-to-image generative models.
arXiv Detail & Related papers (2024-12-13T23:15:35Z)
Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment [53.45813302866466]
We present ISG, a comprehensive evaluation framework for interleaved text-and-image generation. ISG evaluates responses on four levels of granularity: holistic, structural, block-level, and image-specific. In conjunction with ISG, we introduce a benchmark, ISG-Bench, encompassing 1,150 samples across 8 categories and 21 subcategories.
arXiv Detail & Related papers (2024-11-26T07:55:57Z)
Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models. We use GPT4V to bridge the gap between the reference image and the text input for the T2I model. We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z)
Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models [16.00576040281808]
We propose a novel framework called Image2Text2Image to evaluate image captioning models. A high similarity score suggests that the model has produced a faithful textual description, while a low score highlights discrepancies. Our framework does not rely on human-annotated captions reference, making it a valuable tool for assessing image captioning models.
arXiv Detail & Related papers (2024-11-08T17:07:01Z)
TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models [39.06617653124486]
We introduce a new evaluation framework called TypeScore to assess a model's ability to generate images with high-fidelity embedded text. Our proposed metric demonstrates greater resolution than CLIPScore to differentiate popular image generation models.
arXiv Detail & Related papers (2024-11-02T07:56:54Z)
VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models [18.259733507395634]
We introduce a new metric called Visual Language Evaluation Understudy (VLEU) VLEU quantifies a model's generalizability by computing the Kullback-Leibler divergence between the marginal distribution of the visual text and the conditional distribution of the images generated by the model. Our experiments demonstrate the effectiveness of VLEU in evaluating the generalization capability of various T2I models.
arXiv Detail & Related papers (2024-09-23T04:50:36Z)
Information Theoretic Text-to-Image Alignment [49.396917351264655]
Mutual Information (MI) is used to guide model alignment. Our method uses self-supervised fine-tuning and relies on a point-wise (MI) estimation between prompts and images. Our analysis indicates that our method is superior to the state-of-the-art, yet it only requires the pre-trained denoising network of the T2I model itself to estimate MI.
arXiv Detail & Related papers (2024-05-31T12:20:02Z)
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark. FineMatch focuses on text and image mismatch detection and correction. We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z)
Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z)
OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation [151.57313182844936]
We propose a new interleaved generation framework based on prompting large-language models (LLMs) and pre-trained text-to-image (T2I) models, namely OpenLEAF. For model assessment, we first propose to use large multi-modal models (LMMs) to evaluate the entity and style consistencies of open-domain interleaved image-text sequences.
arXiv Detail & Related papers (2023-10-11T17:58:33Z)
T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation [62.71574695256264]
T2I-CompBench is a comprehensive benchmark for open-world compositional text-to-image generation. We propose several evaluation metrics specifically designed to evaluate compositional text-to-image generation. We introduce a new approach, Generative mOdel fine-tuning with Reward-driven Sample selection (GORS) to boost the compositional text-to-image generation abilities.
arXiv Detail & Related papers (2023-07-12T17:59:42Z)
Improving Generation and Evaluation of Visual Stories via Semantic Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task. We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.