Object-Centric Unsupervised Image Captioning
- URL: http://arxiv.org/abs/2112.00969v1
- Date: Thu, 2 Dec 2021 03:56:09 GMT
- Title: Object-Centric Unsupervised Image Captioning
- Authors: Zihang Meng, David Yang, Xuefei Cao, Ashish Shah, Ser-Nam Lim
- Abstract summary: In the supervised setting, image-caption pairs are "well-matched", where all objects mentioned in the sentence appear in the corresponding image.
Our work in this paper overcomes this by harvesting objects corresponding to a given sentence from the training set, even if they don't belong to the same image.
When used as input to a transformer, such mixture of objects enable larger if not full object coverage.
- Score: 19.59302443472258
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training an image captioning model in an unsupervised manner without
utilizing annotated image-caption pairs is an important step towards tapping
into a wider corpus of text and images. In the supervised setting,
image-caption pairs are "well-matched", where all objects mentioned in the
sentence appear in the corresponding image. These pairings are, however, not
available in the unsupervised setting. To overcome this, a main school of
research that has been shown to be effective in overcoming this is to construct
pairs from the images and texts in the training set according to their overlap
of objects. Unlike in the supervised setting, these constructed pairings are
however not guaranteed to have fully overlapping set of objects. Our work in
this paper overcomes this by harvesting objects corresponding to a given
sentence from the training set, even if they don't belong to the same image.
When used as input to a transformer, such mixture of objects enable larger if
not full object coverage, and when supervised by the corresponding sentence,
produced results that outperform current state of the art unsupervised methods
by a significant margin. Building upon this finding, we further show that (1)
additional information on relationship between objects and attributes of
objects also helps in boosting performance; and (2) our method also extends
well to non-English image captioning, which usually suffers from a scarcer
level of annotations. Our findings are supported by strong empirical results.
Related papers
- Towards Image Semantics and Syntax Sequence Learning [8.033697392628424]
We introduce the concept of "image grammar", consisting of "image semantics" and "image syntax"
We propose a weakly supervised two-stage approach to learn the image grammar relative to a class of visual objects/scenes.
Our framework is trained to reason over patch semantics and detect faulty syntax.
arXiv Detail & Related papers (2024-01-31T00:16:02Z) - Object-aware Contrastive Learning for Debiased Scene Representation [74.30741492814327]
We develop a novel object-aware contrastive learning framework that localizes objects in a self-supervised manner.
We also introduce two data augmentations based on ContraCAM, object-aware random crop and background mixup, which reduce contextual and background biases during contrastive self-supervised learning.
arXiv Detail & Related papers (2021-07-30T19:24:07Z) - Exploring Set Similarity for Dense Self-supervised Representation
Learning [96.35286140203407]
We propose to explore textbfset textbfsimilarity (SetSim) for dense self-supervised representation learning.
We generalize pixel-wise similarity learning to set-wise one to improve the robustness because sets contain more semantic and structure information.
Specifically, by resorting to attentional features of views, we establish corresponding sets, thus filtering out noisy backgrounds that may cause incorrect correspondences.
arXiv Detail & Related papers (2021-07-19T09:38:27Z) - Removing Word-Level Spurious Alignment between Images and
Pseudo-Captions in Unsupervised Image Captioning [37.14912430046118]
Unsupervised image captioning is a challenging task that aims at generating captions without the supervision of image-sentence pairs.
We propose a simple gating mechanism that is trained to align image features with only the most reliable words in pseudo-captions.
arXiv Detail & Related papers (2021-04-28T16:36:52Z) - Instance Localization for Self-supervised Detection Pretraining [68.24102560821623]
We propose a new self-supervised pretext task, called instance localization.
We show that integration of bounding boxes into pretraining promotes better task alignment and architecture alignment for transfer learning.
Experimental results demonstrate that our approach yields state-of-the-art transfer learning results for object detection.
arXiv Detail & Related papers (2021-02-16T17:58:57Z) - Learning Object Detection from Captions via Textual Scene Attributes [70.90708863394902]
We argue that captions contain much richer information about the image, including attributes of objects and their relations.
We present a method that uses the attributes in this "textual scene graph" to train object detectors.
We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets.
arXiv Detail & Related papers (2020-09-30T10:59:20Z) - A Self-supervised GAN for Unsupervised Few-shot Object Recognition [39.79912546252623]
This paper addresses unsupervised few-shot object recognition.
All training images are unlabeled, and test images are divided into queries and a few labeled support images per object class of interest.
We extend the vanilla GAN with two loss functions, both aimed at self-supervised learning.
arXiv Detail & Related papers (2020-08-16T19:47:26Z) - Improving Weakly Supervised Visual Grounding by Contrastive Knowledge
Distillation [55.198596946371126]
We propose a contrastive learning framework that accounts for both region-phrase and image-sentence matching.
Our core innovation is the learning of a region-phrase score function, based on which an image-sentence score function is further constructed.
The design of such score functions removes the need of object detection at test time, thereby significantly reducing the inference cost.
arXiv Detail & Related papers (2020-07-03T22:02:00Z) - Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation [128.03739769844736]
Two neural co-attentions are incorporated into the classifier to capture cross-image semantic similarities and differences.
In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference.
Our algorithm sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability.
arXiv Detail & Related papers (2020-07-03T21:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.