Removing Word-Level Spurious Alignment between Images and
Pseudo-Captions in Unsupervised Image Captioning
- URL: http://arxiv.org/abs/2104.13872v1
- Date: Wed, 28 Apr 2021 16:36:52 GMT
- Title: Removing Word-Level Spurious Alignment between Images and
Pseudo-Captions in Unsupervised Image Captioning
- Authors: Ukyo Honda, Yoshitaka Ushiku, Atsushi Hashimoto, Taro Watanabe, Yuji
Matsumoto
- Abstract summary: Unsupervised image captioning is a challenging task that aims at generating captions without the supervision of image-sentence pairs.
We propose a simple gating mechanism that is trained to align image features with only the most reliable words in pseudo-captions.
- Score: 37.14912430046118
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unsupervised image captioning is a challenging task that aims at generating
captions without the supervision of image-sentence pairs, but only with images
and sentences drawn from different sources and object labels detected from the
images. In previous work, pseudo-captions, i.e., sentences that contain the
detected object labels, were assigned to a given image. The focus of the
previous work was on the alignment of input images and pseudo-captions at the
sentence level. However, pseudo-captions contain many words that are irrelevant
to a given image. In this work, we investigate the effect of removing
mismatched words from image-sentence alignment to determine how they make this
task difficult. We propose a simple gating mechanism that is trained to align
image features with only the most reliable words in pseudo-captions: the
detected object labels. The experimental results show that our proposed method
outperforms the previous methods without introducing complex sentence-level
learning objectives. Combined with the sentence-level alignment method of
previous work, our method further improves its performance. These results
confirm the importance of careful alignment in word-level details.
Related papers
- Learning Camouflaged Object Detection from Noisy Pseudo Label [60.9005578956798]
This paper introduces the first weakly semi-supervised Camouflaged Object Detection (COD) method.
It aims for budget-efficient and high-precision camouflaged object segmentation with an extremely limited number of fully labeled images.
We propose a noise correction loss that facilitates the model's learning of correct pixels in the early learning stage.
When using only 20% of fully labeled data, our method shows superior performance over the state-of-the-art methods.
arXiv Detail & Related papers (2024-07-18T04:53:51Z) - Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning [71.14084801851381]
Change captioning aims to succinctly describe the semantic change between a pair of similar images.
Most existing methods directly capture the difference between them, which risk obtaining error-prone difference features.
We propose a distractors-immune representation learning network that correlates the corresponding channels of two image representations.
arXiv Detail & Related papers (2024-07-16T13:00:33Z) - Towards Image Semantics and Syntax Sequence Learning [8.033697392628424]
We introduce the concept of "image grammar", consisting of "image semantics" and "image syntax"
We propose a weakly supervised two-stage approach to learn the image grammar relative to a class of visual objects/scenes.
Our framework is trained to reason over patch semantics and detect faulty syntax.
arXiv Detail & Related papers (2024-01-31T00:16:02Z) - Object-Centric Unsupervised Image Captioning [19.59302443472258]
In the supervised setting, image-caption pairs are "well-matched", where all objects mentioned in the sentence appear in the corresponding image.
Our work in this paper overcomes this by harvesting objects corresponding to a given sentence from the training set, even if they don't belong to the same image.
When used as input to a transformer, such mixture of objects enable larger if not full object coverage.
arXiv Detail & Related papers (2021-12-02T03:56:09Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z) - Contrastive Learning for Unsupervised Image-to-Image Translation [10.091669091440396]
We propose an unsupervised image-to-image translation method based on contrastive learning.
We randomly sample a pair of images and train the generator to change the appearance of one towards another while keeping the original structure.
Experimental results show that our method outperforms the leading unsupervised baselines in terms of visual quality and translation accuracy.
arXiv Detail & Related papers (2021-05-07T08:43:38Z) - CAPTION: Correction by Analyses, POS-Tagging and Interpretation of
Objects using only Nouns [1.4502611532302039]
This work proposes a combination of Deep Learning methods for object detection and natural language processing to validate image's captions.
We test our method in the FOIL-COCO data set, since it provides correct and incorrect captions for various images using only objects represented in the MS-COCO image data set.
arXiv Detail & Related papers (2020-10-02T08:06:42Z) - Improving Weakly Supervised Visual Grounding by Contrastive Knowledge
Distillation [55.198596946371126]
We propose a contrastive learning framework that accounts for both region-phrase and image-sentence matching.
Our core innovation is the learning of a region-phrase score function, based on which an image-sentence score function is further constructed.
The design of such score functions removes the need of object detection at test time, thereby significantly reducing the inference cost.
arXiv Detail & Related papers (2020-07-03T22:02:00Z) - Learning Representations by Predicting Bags of Visual Words [55.332200948110895]
Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data.
Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions.
arXiv Detail & Related papers (2020-02-27T16:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.