TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation
with Question Answering
- URL: http://arxiv.org/abs/2303.11897v3
- Date: Thu, 17 Aug 2023 21:45:52 GMT
- Title: TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation
with Question Answering
- Authors: Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf,
Ranjay Krishna, Noah A Smith
- Abstract summary: We introduce an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual question answering (VQA)
We present a comprehensive evaluation of existing text-to-image models using a benchmark consisting of 4K diverse text inputs and 25K questions across 12 categories (object, counting, etc.)
- Score: 86.38098280689027
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite thousands of researchers, engineers, and artists actively working on
improving text-to-image generation models, systems often fail to produce images
that accurately align with the text inputs. We introduce TIFA (Text-to-Image
Faithfulness evaluation with question Answering), an automatic evaluation
metric that measures the faithfulness of a generated image to its text input
via visual question answering (VQA). Specifically, given a text input, we
automatically generate several question-answer pairs using a language model. We
calculate image faithfulness by checking whether existing VQA models can answer
these questions using the generated image. TIFA is a reference-free metric that
allows for fine-grained and interpretable evaluations of generated images. TIFA
also has better correlations with human judgments than existing metrics. Based
on this approach, we introduce TIFA v1.0, a benchmark consisting of 4K diverse
text inputs and 25K questions across 12 categories (object, counting, etc.). We
present a comprehensive evaluation of existing text-to-image models using TIFA
v1.0 and highlight the limitations and challenges of current models. For
instance, we find that current text-to-image models, despite doing well on
color and material, still struggle in counting, spatial relations, and
composing multiple objects. We hope our benchmark will help carefully measure
the research progress in text-to-image synthesis and provide valuable insights
for further research.
Related papers
- Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering [13.490305443938817]
We introduce I-HallA (Image Hallucination evaluation with Question Answering), a novel evaluation metric.
I-HallA measures the factuality of generated images through visual question answering (VQA)
We evaluate five text-to-image models using I-HallA and reveal that these state-of-the-art models often fail to accurately convey factual information.
arXiv Detail & Related papers (2024-09-19T13:51:21Z) - Evaluating Text-to-Visual Generation with Image-to-Text Generation [113.07368313330994]
VQAScore is a visual-question-answering (VQA) model to produce an alignment score.
It produces state-of-the-art results across many (8) image-text alignment benchmarks.
We introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts.
arXiv Detail & Related papers (2024-04-01T17:58:06Z) - Zero-shot Translation of Attention Patterns in VQA Models to Natural
Language [65.94419474119162]
ZS-A2T is a framework that translates the transformer attention of a given model into natural language without requiring any training.
We consider this in the context of Visual Question Answering (VQA)
Our framework does not require any training and allows the drop-in replacement of different guiding sources.
arXiv Detail & Related papers (2023-11-08T22:18:53Z) - Holistic Evaluation of Text-To-Image Models [153.47415461488097]
We introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM)
We identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency.
Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths.
arXiv Detail & Related papers (2023-11-07T19:00:56Z) - What You See is What You Read? Improving Text-Image Alignment Evaluation [28.722369586165108]
We study methods for automatic text-image alignment evaluation.
We first introduce SeeTRUE, spanning multiple datasets from both text-to-image and image-to-text generation tasks.
We describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models.
arXiv Detail & Related papers (2023-05-17T17:43:38Z) - Learning the Visualness of Text Using Large Vision-Language Models [42.75864384249245]
Visual text evokes an image in a person's mind, while non-visual text fails to do so.
A method to automatically detect visualness in text will enable text-to-image retrieval and generation models to augment text with relevant images.
We curate a dataset of 3,620 English sentences and their visualness scores provided by multiple human annotators.
arXiv Detail & Related papers (2023-05-11T17:45:16Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - VisualMRC: Machine Reading Comprehension on Document Images [4.057968826847943]
Given a question and a document image, a machine reads and comprehends texts in the image to answer the question in natural language.
VisualMRC focuses more on developing natural language understanding and generation abilities.
It contains 30,000+ pairs of a question and an abstractive answer for 10,000+ document images sourced from multiple domains of webpages.
arXiv Detail & Related papers (2021-01-27T09:03:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.