Related papers: Visual question answering based evaluation metrics for text-to-image generation

Visual question answering based evaluation metrics for text-to-image generation

URL: http://arxiv.org/abs/2411.10183v1
Date: Fri, 15 Nov 2024 13:32:23 GMT
Title: Visual question answering based evaluation metrics for text-to-image generation
Authors: Mizuki Miyamoto, Ryugo Morita, Jinjia Zhou,
Abstract summary: This paper proposes new evaluation metrics that assess the alignment between input text and generated images for every individual object. Experimental results show that our proposed evaluation approach is the superior metric that can simultaneously assess finer text-image alignment and image quality.
Score: 7.105786967332924
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-image generation and text-guided image manipulation have received considerable attention in the field of image generation tasks. However, the mainstream evaluation methods for these tasks have difficulty in evaluating whether all the information from the input text is accurately reflected in the generated images, and they mainly focus on evaluating the overall alignment between the input text and the generated images. This paper proposes new evaluation metrics that assess the alignment between input text and generated images for every individual object. Firstly, according to the input text, chatGPT is utilized to produce questions for the generated images. After that, we use Visual Question Answering(VQA) to measure the relevance of the generated images to the input text, which allows for a more detailed evaluation of the alignment compared to existing methods. In addition, we use Non-Reference Image Quality Assessment(NR-IQA) to evaluate not only the text-image alignment but also the quality of the generated images. Experimental results show that our proposed evaluation approach is the superior metric that can simultaneously assess finer text-image alignment and image quality while allowing for the adjustment of these ratios.

Related papers

Re-Thinking the Automatic Evaluation of Image-Text Alignment in Text-to-Image Models [44.05134959039957]
Text-to-image models often struggle to generate images that precisely match textual prompts.<n>Existing evaluations primarily focus on agreement with human assessments.<n>We propose recommendations for improving image-text alignment evaluation.
arXiv Detail & Related papers (2025-06-10T06:11:36Z)
GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing [60.66800567924348]
We introduce a new benchmark designed to evaluate text-guided image editing models.<n>The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories.<n>We conduct a large-scale study comparing GPT-Image-1 against several state-of-the-art editing models.
arXiv Detail & Related papers (2025-05-16T17:55:54Z)
RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation [27.336251972097077]
RefVNLI is a cost-effective metric that evaluates both textual alignment and subject preservation in a single prediction. It outperforms or matches existing baselines across multiple benchmarks and subject categories. It also excels with lesser-known concepts, aligning with human preferences at over 87% accuracy.
arXiv Detail & Related papers (2025-04-24T12:44:51Z)
T2I-FineEval: Fine-Grained Compositional Metric for Text-to-Image Evaluation [2.273629240935727]
We propose a novel metric that breaks down images into components, and texts into fine-grained questions about the generated image for evaluation. Our method outperforms previous state-of-the-art metrics, demonstrating its effectiveness in evaluating text-to-image generative models.
arXiv Detail & Related papers (2025-03-14T15:06:12Z)
Towards More Accurate Personalized Image Generation: Addressing Overfitting and Evaluation Bias [52.590072198551944]
The aim of image personalization is to create images based on a user-provided subject. Current methods face challenges in ensuring fidelity to the text prompt. We introduce a novel training pipeline that incorporates an attractor to filter out distractions in training images.
arXiv Detail & Related papers (2025-03-09T14:14:02Z)
Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models [16.00576040281808]
We propose a novel framework called Image2Text2Image to evaluate image captioning models. A high similarity score suggests that the model has produced a faithful textual description, while a low score highlights discrepancies. Our framework does not rely on human-annotated captions reference, making it a valuable tool for assessing image captioning models.
arXiv Detail & Related papers (2024-11-08T17:07:01Z)
A Novel Evaluation Framework for Image2Text Generation [15.10524860121122]
We propose an evaluation framework rooted in a modern large language model (LLM) capable of image generation. A high similarity score suggests that the image captioning model has accurately generated textual descriptions. A low similarity score indicates discrepancies, revealing potential shortcomings in the model's performance.
arXiv Detail & Related papers (2024-08-03T09:27:57Z)
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark. FineMatch focuses on text and image mismatch detection and correction. We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z)
A Survey on Quality Metrics for Text-to-Image Generation [9.753473063305503]
AI-based text-to-image models do not only excel at generating realistic images, they also give designers more and more fine-grained control over the image content. These approaches have gathered increased attention within the computer graphics research community. We provide a comprehensive overview of such text-to-image quality metrics, and propose a taxonomy to categorize these metrics.
arXiv Detail & Related papers (2024-03-18T14:24:20Z)
Holistic Evaluation of Text-To-Image Models [153.47415461488097]
We introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM) We identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths.
arXiv Detail & Related papers (2023-11-07T19:00:56Z)
Likelihood-Based Text-to-Image Evaluation with Patch-Level Perceptual and Semantic Credit Assignment [48.835298314274254]
We propose to evaluate text-to-image generation performance by directly estimating the likelihood of the generated images. A higher likelihood indicates better perceptual quality and better text-image alignment. It can successfully assess the generation ability of these models with as few as a hundred samples.
arXiv Detail & Related papers (2023-08-16T17:26:47Z)
Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback [20.78162037954646]
We introduce a decompositional approach towards evaluation and improvement of text-to-image alignment. Human user studies indicate that the proposed approach surpasses previous state-of-the-art by 8.7% in overall text-to-image alignment accuracy.
arXiv Detail & Related papers (2023-07-10T17:54:57Z)
What You See is What You Read? Improving Text-Image Alignment Evaluation [28.722369586165108]
We study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE, spanning multiple datasets from both text-to-image and image-to-text generation tasks. We describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models.
arXiv Detail & Related papers (2023-05-17T17:43:38Z)
InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation [69.1642316502563]
We propose an Informative Metric for Reference-free Image Caption evaluation (InfoMetIC) Given an image and a caption, InfoMetIC is able to report incorrect words and unmentioned image regions at fine-grained level. We also construct a token-level evaluation dataset and demonstrate the effectiveness of InfoMetIC in fine-grained evaluation.
arXiv Detail & Related papers (2023-05-10T09:22:44Z)
TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering [86.38098280689027]
We introduce an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual question answering (VQA) We present a comprehensive evaluation of existing text-to-image models using a benchmark consisting of 4K diverse text inputs and 25K questions across 12 categories (object, counting, etc.)
arXiv Detail & Related papers (2023-03-21T14:41:02Z)
Transparent Human Evaluation for Image Captioning [70.03979566548823]
We develop a rubric-based human evaluation protocol for image captioning models. We show that human-generated captions show substantially higher quality than machine-generated ones. We hope that this work will promote a more transparent evaluation protocol for image captioning.
arXiv Detail & Related papers (2021-11-17T07:09:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.