Revising Image-Text Retrieval via Multi-Modal Entailment
- URL: http://arxiv.org/abs/2208.10126v1
- Date: Mon, 22 Aug 2022 07:58:54 GMT
- Title: Revising Image-Text Retrieval via Multi-Modal Entailment
- Authors: Xu Yan, Chunhui Ai, Ziqiang Cao, Min Cao, Sujian Li, Wenjie Chen,
Guohong Fu
- Abstract summary: Many-to-many matching phenomenon is quite common in the widely-used image-text retrieval datasets.
We propose a multi-modal entailment classifier to determine whether a sentence is entailed by an image plus its linked captions.
- Score: 25.988058843564335
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An outstanding image-text retrieval model depends on high-quality labeled
data. While the builders of existing image-text retrieval datasets strive to
ensure that the caption matches the linked image, they cannot prevent a caption
from fitting other images. We observe that such a many-to-many matching
phenomenon is quite common in the widely-used retrieval datasets, where one
caption can describe up to 178 images. These large matching-lost data not only
confuse the model in training but also weaken the evaluation accuracy. Inspired
by visual and textual entailment tasks, we propose a multi-modal entailment
classifier to determine whether a sentence is entailed by an image plus its
linked captions. Subsequently, we revise the image-text retrieval datasets by
adding these entailed captions as additional weak labels of an image and
develop a universal variable learning rate strategy to teach a retrieval model
to distinguish the entailed captions from other negative samples. In
experiments, we manually annotate an entailment-corrected image-text retrieval
dataset for evaluation. The results demonstrate that the proposed entailment
classifier achieves about 78% accuracy and consistently improves the
performance of image-text retrieval baselines.
Related papers
- Evaluating authenticity and quality of image captions via sentiment and semantic analyses [0.0]
Deep learning relies heavily on huge amounts of labelled data for tasks such as natural language processing and computer vision.
In image-to-text or image-to-image pipelines, opinion (sentiment) may be inadvertently learned by a model from human-generated image captions.
This study proposes an evaluation method focused on sentiment and semantic richness.
arXiv Detail & Related papers (2024-09-14T23:50:23Z) - TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification [59.779532652634295]
We propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs.
We parse objects and attributes from the description, which are highly likely to exist in the image.
Experiments substantiate the average 5.2% improvement of our framework over existing alternatives.
arXiv Detail & Related papers (2023-12-21T18:59:06Z) - ASPIRE: Language-Guided Data Augmentation for Improving Robustness Against Spurious Correlations [43.323791505213634]
ASPIRE (Language-guided Data Augmentation for SPurIous correlation REmoval) is a solution for supplementing the training dataset with images without spurious features.
It can generate non-spurious images without requiring any group labeling or existing non-spurious images in the training set.
It improves the worst-group classification accuracy of prior methods by 1% - 38%.
arXiv Detail & Related papers (2023-08-19T20:18:15Z) - Image Captioners Sometimes Tell More Than Images They See [8.640488282016351]
Image captioning, a.k.a. "image-to-text," generates descriptive text from given images.
We have performed experiments involving the classification of images from descriptive text alone.
We have evaluated several image captioning models with respect to a disaster image classification task, CrisisNLP.
arXiv Detail & Related papers (2023-05-04T15:32:41Z) - Semi-Supervised Image Captioning by Adversarially Propagating Labeled
Data [95.0476489266988]
We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models.
Our proposed method trains a captioner to learn from a paired data and to progressively associate unpaired data.
Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired dataset.
arXiv Detail & Related papers (2023-01-26T15:25:43Z) - Is An Image Worth Five Sentences? A New Look into Semantics for
Image-Text Matching [10.992151305603267]
We propose two metrics that evaluate the degree of semantic relevance of retrieved items, independently of their annotated binary relevance.
We incorporate a novel strategy that uses an image captioning metric, CIDEr, to define a Semantic Adaptive Margin (SAM) to be optimized in a standard triplet loss.
arXiv Detail & Related papers (2021-10-06T09:54:28Z) - NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media [93.51739200834837]
We propose a dataset where both image and text are unmanipulated but mismatched.
We introduce several strategies for automatic retrieval of suitable images for the given captions.
Our large-scale automatically generated NewsCLIPpings dataset requires models to jointly analyze both modalities.
arXiv Detail & Related papers (2021-04-13T01:53:26Z) - Telling the What while Pointing the Where: Fine-grained Mouse Trace and
Language Supervision for Improved Image Retrieval [60.24860627782486]
Fine-grained image retrieval often requires the ability to also express the where in the image the content they are looking for is.
In this paper, we describe an image retrieval setup where the user simultaneously describes an image using both spoken natural language (the "what") and mouse traces over an empty canvas (the "where")
Our model is capable of taking this spatial guidance into account, and provides more accurate retrieval results compared to text-only equivalent systems.
arXiv Detail & Related papers (2021-02-09T17:54:34Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Using Text to Teach Image Retrieval [47.72498265721957]
We build on the concept of image manifold to represent the feature space of images, learned via neural networks, as a graph.
We augment the manifold samples with geometrically aligned text, thereby using a plethora of sentences to teach us about images.
The experimental results show that the joint embedding manifold is a robust representation, allowing it to be a better basis to perform image retrieval.
arXiv Detail & Related papers (2020-11-19T16:09:14Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.