What You See is What You Read? Improving Text-Image Alignment Evaluation
- URL: http://arxiv.org/abs/2305.10400v4
- Date: Tue, 26 Dec 2023 15:58:24 GMT
- Title: What You See is What You Read? Improving Text-Image Alignment Evaluation
- Authors: Michal Yarom, Yonatan Bitton, Soravit Changpinyo, Roee Aharoni,
Jonathan Herzig, Oran Lang, Eran Ofek, Idan Szpektor
- Abstract summary: We study methods for automatic text-image alignment evaluation.
We first introduce SeeTRUE, spanning multiple datasets from both text-to-image and image-to-text generation tasks.
We describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models.
- Score: 28.722369586165108
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatically determining whether a text and a corresponding image are
semantically aligned is a significant challenge for vision-language models,
with applications in generative text-to-image and image-to-text tasks. In this
work, we study methods for automatic text-image alignment evaluation. We first
introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets
from both text-to-image and image-to-text generation tasks, with human
judgements for whether a given text-image pair is semantically aligned. We then
describe two automatic methods to determine alignment: the first involving a
pipeline based on question generation and visual question answering models, and
the second employing an end-to-end classification approach by finetuning
multimodal pretrained models. Both methods surpass prior approaches in various
text-image alignment tasks, with significant improvements in challenging cases
that involve complex composition or unnatural images. Finally, we demonstrate
how our approaches can localize specific misalignments between an image and a
given text, and how they can be used to automatically re-rank candidates in
text-to-image generation.
Related papers
- Visual question answering based evaluation metrics for text-to-image generation [7.105786967332924]
This paper proposes new evaluation metrics that assess the alignment between input text and generated images for every individual object.
Experimental results show that our proposed evaluation approach is the superior metric that can simultaneously assess finer text-image alignment and image quality.
arXiv Detail & Related papers (2024-11-15T13:32:23Z) - Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment [64.49170817854942]
We present a method to provide detailed explanation of detected misalignments between text-image pairs.
We leverage large language models and visual grounding models to automatically construct a training set that holds plausible captions for a given image.
We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations.
arXiv Detail & Related papers (2023-12-05T20:07:34Z) - Style Generation: Image Synthesis based on Coarsely Matched Texts [10.939482612568433]
We introduce a novel task called text-based style generation and propose a two-stage generative adversarial network.
The first stage generates the overall image style with a sentence feature, and the second stage refines the generated style with a synthetic feature.
The practical potential of our work is demonstrated by various applications such as text-image alignment and story visualization.
arXiv Detail & Related papers (2023-09-08T21:51:11Z) - Learning to Generate Semantic Layouts for Higher Text-Image
Correspondence in Text-to-Image Synthesis [37.32270579534541]
We propose a novel approach for enhancing text-image correspondence by leveraging available semantic layouts.
Our approach achieves higher text-image correspondence compared to existing text-to-image generation approaches in the Multi-Modal CelebA-HQ and the Cityscapes dataset.
arXiv Detail & Related papers (2023-08-16T05:59:33Z) - Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [11.798006331912056]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions.
We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - TextMatcher: Cross-Attentional Neural Network to Compare Image and Text [0.0]
We devise the first machine-learning model specifically designed for this problem.
We extensively evaluate the empirical performance of TextMatcher on the popular IAM dataset.
We showcase TextMatcher in a real-world application scenario concerning the automatic processing of bank cheques.
arXiv Detail & Related papers (2022-05-11T14:01:12Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media [93.51739200834837]
We propose a dataset where both image and text are unmanipulated but mismatched.
We introduce several strategies for automatic retrieval of suitable images for the given captions.
Our large-scale automatically generated NewsCLIPpings dataset requires models to jointly analyze both modalities.
arXiv Detail & Related papers (2021-04-13T01:53:26Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.