Related papers: Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering

Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering

URL: http://arxiv.org/abs/2409.12784v7
Date: Mon, 10 Feb 2025 13:10:32 GMT
Title: Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering
Authors: Youngsun Lim, Hojun Choi, Hyunjung Shim,
Abstract summary: We introduce I-HallA (Image Hallucination evaluation with Question Answering), a novel evaluation metric. I-HallA measures the factuality of generated images through visual question answering (VQA) We evaluate five TTI models using I-HallA and reveal that these state-of-the-art models often fail to accurately convey factual information.
Score: 13.490305443938817
License:
Abstract: Despite the impressive success of text-to-image (TTI) generation models, existing studies overlook the issue of whether these models accurately convey factual information. In this paper, we focus on the problem of image hallucination, where images created by generation models fail to faithfully depict factual content. To address this, we introduce I-HallA (Image Hallucination evaluation with Question Answering), a novel automated evaluation metric that measures the factuality of generated images through visual question answering (VQA). We also introduce I-HallA v1.0, a curated benchmark dataset for this purpose. As part of this process, we develop a pipeline that generates high-quality question-answer pairs using multiple GPT-4 Omni-based agents, with human judgments to ensure accuracy. Our evaluation protocols measure image hallucination by testing if images from existing TTI models can correctly respond to these questions. The I-HallA v1.0 dataset comprises 1.2K diverse image-text pairs across nine categories with 1,000 rigorously curated questions covering various compositional challenges. We evaluate five TTI models using I-HallA and reveal that these state-of-the-art models often fail to accurately convey factual information. Moreover, we validate the reliability of our metric by demonstrating a strong Spearman correlation ($\rho$=0.95) with human judgments. We believe our benchmark dataset and metric can serve as a foundation for developing factually accurate TTI generation models. Additional resources can be found on our project page: https://sgt-lim.github.io/I-HallA/.

Related papers

HAUR: Human Annotation Understanding and Recognition Through Text-Heavy Images [4.468589513127865]
Vision Question Answering (VQA) tasks use images to convey critical information to answer text-based questions. Our dataset and model will be released soon.
arXiv Detail & Related papers (2024-12-24T10:25:41Z)
Evaluating Hallucination in Text-to-Image Diffusion Models with Scene-Graph based Question-Answering Agent [9.748808189341526]
An effective Text-to-Image (T2I) evaluation metric should accomplish the following: detect instances where the generated images do not align with the textual prompts. We propose a method based on large language models (LLMs) for conducting question-answering with an extracted scene-graph and created a dataset with human-rated scores for generated images.
arXiv Detail & Related papers (2024-12-07T18:44:38Z)
How Many Van Goghs Does It Take to Van Gogh? Finding the Imitation Threshold [50.33428591760124]
We study the relationship between a concept's frequency in the training dataset and the ability of a model to imitate it. We propose an efficient approach that estimates the imitation threshold without incurring the colossal cost of training multiple models from scratch.
arXiv Detail & Related papers (2024-10-19T06:28:14Z)
Holistic Evaluation of Text-To-Image Models [153.47415461488097]
We introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM) We identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths.
arXiv Detail & Related papers (2023-11-07T19:00:56Z)
Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation [64.64849950642619]
We develop an evaluation framework inspired by formal semantics for evaluating text-to-image models. We show that Davidsonian Scene Graph (DSG) produces atomic and unique questions organized in dependency graphs. We also present DSG-1k, an open-sourced evaluation benchmark that includes 1,060 prompts.
arXiv Detail & Related papers (2023-10-27T16:20:10Z)
On quantifying and improving realism of images generated with diffusion [50.37578424163951]
We propose a metric, called Image Realism Score (IRS), computed from five statistical measures of a given image. IRS is easily usable as a measure to classify a given image as real or fake. We experimentally establish the model- and data-agnostic nature of the proposed IRS by successfully detecting fake images generated by Stable Diffusion Model (SDM), Dalle2, Midjourney and BigGAN. Our efforts have also led to Gen-100 dataset, which provides 1,000 samples for 100 classes generated by four high-quality models.
arXiv Detail & Related papers (2023-09-26T08:32:55Z)
TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering [86.38098280689027]
We introduce an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual question answering (VQA) We present a comprehensive evaluation of existing text-to-image models using a benchmark consisting of 4K diverse text inputs and 25K questions across 12 categories (object, counting, etc.)
arXiv Detail & Related papers (2023-03-21T14:41:02Z)
How good are deep models in understanding the generated images? [47.64219291655723]
Two sets of generated images are collected for object recognition and visual question answering tasks. On object recognition, the best model, out of 10 state-of-the-art object recognition models, achieves about 60% and 80% top-1 and top-5 accuracy. On VQA, the OFA model scores 77.3% on answering 241 binary questions across 50 images.
arXiv Detail & Related papers (2022-08-23T06:44:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.