Improving Weakly Supervised Visual Grounding by Contrastive Knowledge
Distillation
- URL: http://arxiv.org/abs/2007.01951v2
- Date: Sun, 25 Apr 2021 05:11:11 GMT
- Title: Improving Weakly Supervised Visual Grounding by Contrastive Knowledge
Distillation
- Authors: Liwei Wang, Jing Huang, Yin Li, Kun Xu, Zhengyuan Yang, Dong Yu
- Abstract summary: We propose a contrastive learning framework that accounts for both region-phrase and image-sentence matching.
Our core innovation is the learning of a region-phrase score function, based on which an image-sentence score function is further constructed.
The design of such score functions removes the need of object detection at test time, thereby significantly reducing the inference cost.
- Score: 55.198596946371126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly supervised phrase grounding aims at learning region-phrase
correspondences using only image-sentence pairs. A major challenge thus lies in
the missing links between image regions and sentence phrases during training.
To address this challenge, we leverage a generic object detector at training
time, and propose a contrastive learning framework that accounts for both
region-phrase and image-sentence matching. Our core innovation is the learning
of a region-phrase score function, based on which an image-sentence score
function is further constructed. Importantly, our region-phrase score function
is learned by distilling from soft matching scores between the detected object
names and candidate phrases within an image-sentence pair, while the
image-sentence score function is supervised by ground-truth image-sentence
pairs. The design of such score functions removes the need of object detection
at test time, thereby significantly reducing the inference cost. Without bells
and whistles, our approach achieves state-of-the-art results on visual phrase
grounding, surpassing previous methods that require expensive object detectors
at test time.
Related papers
- Top-Down Framework for Weakly-supervised Grounded Image Captioning [19.00510117145054]
Weakly-supervised grounded image captioning aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision.
We propose a one-stage weakly-supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level.
arXiv Detail & Related papers (2023-06-13T01:42:18Z) - Learning to search for and detect objects in foveal images using deep
learning [3.655021726150368]
This study employs a fixation prediction model that emulates human objective-guided attention of searching for a given class in an image.
The foveated pictures at each fixation point are then classified to determine whether the target is present or absent in the scene.
We present a novel dual task model capable of performing fixation prediction and detection simultaneously, allowing knowledge transfer between the two tasks.
arXiv Detail & Related papers (2023-04-12T09:50:25Z) - Distributed Attention for Grounded Image Captioning [55.752968732796354]
We study the problem of weakly supervised grounded image captioning.
The goal is to automatically generate a sentence describing the context of the image with each noun word grounded to the corresponding region in the image.
arXiv Detail & Related papers (2021-08-02T17:28:33Z) - Removing Word-Level Spurious Alignment between Images and
Pseudo-Captions in Unsupervised Image Captioning [37.14912430046118]
Unsupervised image captioning is a challenging task that aims at generating captions without the supervision of image-sentence pairs.
We propose a simple gating mechanism that is trained to align image features with only the most reliable words in pseudo-captions.
arXiv Detail & Related papers (2021-04-28T16:36:52Z) - Detector-Free Weakly Supervised Grounding by Separation [76.65699170882036]
Weakly Supervised phrase-Grounding (WSG) deals with the task of using data to learn to localize arbitrary text phrases in images.
We propose Detector-Free WSG (DF-WSG) to solve WSG without relying on a pre-trained detector.
We demonstrate a significant accuracy improvement, of up to $8.5%$ over previous DF-WSG SotA.
arXiv Detail & Related papers (2021-04-20T08:27:31Z) - A Simple and Effective Use of Object-Centric Images for Long-Tailed
Object Detection [56.82077636126353]
We take advantage of object-centric images to improve object detection in scene-centric images.
We present a simple yet surprisingly effective framework to do so.
Our approach can improve the object detection (and instance segmentation) accuracy of rare objects by 50% (and 33%) relatively.
arXiv Detail & Related papers (2021-02-17T17:27:21Z) - Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation [128.03739769844736]
Two neural co-attentions are incorporated into the classifier to capture cross-image semantic similarities and differences.
In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference.
Our algorithm sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability.
arXiv Detail & Related papers (2020-07-03T21:53:46Z) - Contrastive Learning for Weakly Supervised Phrase Grounding [99.73968052506206]
We show that phrase grounding can be learned by optimizing word-region attention.
A key idea is to construct effective negative captions for learning through language model guided word substitutions.
Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of $5.7%$ to achieve $76.7%$ accuracy on Flickr30K benchmark.
arXiv Detail & Related papers (2020-06-17T15:00:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.