Contrastive Learning for Weakly Supervised Phrase Grounding
- URL: http://arxiv.org/abs/2006.09920v3
- Date: Wed, 5 Aug 2020 21:53:38 GMT
- Title: Contrastive Learning for Weakly Supervised Phrase Grounding
- Authors: Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, Jan Kautz, and
Derek Hoiem
- Abstract summary: We show that phrase grounding can be learned by optimizing word-region attention.
A key idea is to construct effective negative captions for learning through language model guided word substitutions.
Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of $5.7%$ to achieve $76.7%$ accuracy on Flickr30K benchmark.
- Score: 99.73968052506206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Phrase grounding, the problem of associating image regions to caption words,
is a crucial component of vision-language tasks. We show that phrase grounding
can be learned by optimizing word-region attention to maximize a lower bound on
mutual information between images and caption words. Given pairs of images and
captions, we maximize compatibility of the attention-weighted regions and the
words in the corresponding caption, compared to non-corresponding pairs of
images and captions. A key idea is to construct effective negative captions for
learning through language model guided word substitutions. Training with our
negatives yields a $\sim10\%$ absolute gain in accuracy over randomly-sampled
negatives from the training data. Our weakly supervised phrase grounding model
trained on COCO-Captions shows a healthy gain of $5.7\%$ to achieve $76.7\%$
accuracy on Flickr30K Entities benchmark.
Related papers
- Visually-Aware Context Modeling for News Image Captioning [54.31708859631821]
News Image Captioning aims to create captions from news articles and images.
We propose a face-naming module for learning better name embeddings.
We use CLIP to retrieve sentences that are semantically close to the image.
arXiv Detail & Related papers (2023-08-16T12:39:39Z) - Top-Down Framework for Weakly-supervised Grounded Image Captioning [19.00510117145054]
Weakly-supervised grounded image captioning aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision.
We propose a one-stage weakly-supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level.
arXiv Detail & Related papers (2023-06-13T01:42:18Z) - Large-Scale Bidirectional Training for Zero-Shot Image Captioning [44.17587735943739]
We introduce Bidirectional Image Text Training in largER Scale, BITTERS, an efficient training and inference framework for zero-shot image captioning.
We show that careful selection of large-scale training set and model architecture is the key to achieving zero-shot image captioning.
arXiv Detail & Related papers (2022-11-13T00:09:36Z) - CapOnImage: Context-driven Dense-Captioning on Image [13.604173177437536]
We introduce a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information.
We propose a multi-modal pre-training model with multi-level pre-training tasks that progressively learn the correspondence between texts and image locations.
Compared with other image captioning model variants, our model achieves the best results in both captioning accuracy and diversity aspects.
arXiv Detail & Related papers (2022-04-27T14:40:31Z) - Distributed Attention for Grounded Image Captioning [55.752968732796354]
We study the problem of weakly supervised grounded image captioning.
The goal is to automatically generate a sentence describing the context of the image with each noun word grounded to the corresponding region in the image.
arXiv Detail & Related papers (2021-08-02T17:28:33Z) - Visually Grounded Compound PCFGs [65.04669567781634]
Exploiting visual groundings for language understanding has recently been drawing much attention.
We study visually grounded grammar induction and learn a constituency from both unlabeled text and its visual captions.
arXiv Detail & Related papers (2020-09-25T19:07:00Z) - Improving Weakly Supervised Visual Grounding by Contrastive Knowledge
Distillation [55.198596946371126]
We propose a contrastive learning framework that accounts for both region-phrase and image-sentence matching.
Our core innovation is the learning of a region-phrase score function, based on which an image-sentence score function is further constructed.
The design of such score functions removes the need of object detection at test time, thereby significantly reducing the inference cost.
arXiv Detail & Related papers (2020-07-03T22:02:00Z) - More Grounded Image Captioning by Distilling Image-Text Matching Model [56.79895670335411]
We propose a Part-of-Speech (POS) enhanced image-text matching model (SCAN) as the effective knowledge distillation for more grounded image captioning.
The benefits are two-fold: 1) given a sentence and an image, POS-SCAN can ground the objects more accurately than SCAN; 2) POS-SCAN serves as a word-region alignment regularization for the captioner's visual attention module.
arXiv Detail & Related papers (2020-04-01T12:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.