Relation-aware Instance Refinement for Weakly Supervised Visual
Grounding
- URL: http://arxiv.org/abs/2103.12989v1
- Date: Wed, 24 Mar 2021 05:03:54 GMT
- Title: Relation-aware Instance Refinement for Weakly Supervised Visual
Grounding
- Authors: Yongfei Liu, Bo Wan, Lin Ma, Xuming He
- Abstract summary: Visual grounding aims to build a correspondence between visual objects and their language entities.
We propose a novel weakly-supervised learning method that incorporates coarse-to-fine object refinement and entity relation modeling.
Experiments on two public benchmarks demonstrate the efficacy of our framework.
- Score: 44.33411132188231
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual grounding, which aims to build a correspondence between visual objects
and their language entities, plays a key role in cross-modal scene
understanding. One promising and scalable strategy for learning visual
grounding is to utilize weak supervision from only image-caption pairs.
Previous methods typically rely on matching query phrases directly to a
precomputed, fixed object candidate pool, which leads to inaccurate
localization and ambiguous matching due to lack of semantic relation
constraints.
In our paper, we propose a novel context-aware weakly-supervised learning
method that incorporates coarse-to-fine object refinement and entity relation
modeling into a two-stage deep network, capable of producing more accurate
object representation and matching. To effectively train our network, we
introduce a self-taught regression loss for the proposal locations and a
classification loss based on parsed entity relations.
Extensive experiments on two public benchmarks Flickr30K Entities and
ReferItGame demonstrate the efficacy of our weakly grounding framework. The
results show that we outperform the previous methods by a considerable margin,
achieving 59.27\% top-1 accuracy in Flickr30K Entities and 37.68\% in the
ReferItGame dataset respectively (Code is available at
https://github.com/youngfly11/ReIR-WeaklyGrounding.pytorch.git).
Related papers
- ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding [42.10086029931937]
Visual grounding aims to localize the object referred to in an image based on a natural language query.
Existing methods demonstrate a significant performance drop when there are multiple distractions in an image.
We propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ResVG) model, to address this issue.
arXiv Detail & Related papers (2024-08-29T07:32:01Z) - GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language
Pre-training and Open-Vocabulary Object Detection [24.48128633414131]
We propose a zero-shot method that harnesses visual grounding ability from existing models trained from image-text pairs and pure object detection data.
We demonstrate that the proposed method significantly outperforms other zero-shot methods on RefCOCO/+/g datasets.
arXiv Detail & Related papers (2023-12-22T20:14:55Z) - CoDo: Contrastive Learning with Downstream Background Invariance for
Detection [10.608660802917214]
We propose a novel object-level self-supervised learning method, called Contrastive learning with Downstream background invariance (CoDo)
The pretext task is converted to focus on instance location modeling for various backgrounds, especially for downstream datasets.
Experiments on MSCOCO demonstrate that the proposed CoDo with common backbones, ResNet50-FPN, yields strong transfer learning results for object detection.
arXiv Detail & Related papers (2022-05-10T01:26:15Z) - Unpaired Referring Expression Grounding via Bidirectional Cross-Modal
Matching [53.27673119360868]
Referring expression grounding is an important and challenging task in computer vision.
We propose a novel bidirectional cross-modal matching (BiCM) framework to address these challenges.
Our framework outperforms previous works by 6.55% and 9.94% on two popular grounding datasets.
arXiv Detail & Related papers (2022-01-18T01:13:19Z) - Weakly-Supervised Video Object Grounding via Causal Intervention [82.68192973503119]
We target at the task of weakly-supervised video object grounding (WSVOG), where only video-sentence annotations are available during model learning.
It aims to localize objects described in the sentence to visual regions in the video, which is a fundamental capability needed in pattern analysis and machine learning.
arXiv Detail & Related papers (2021-12-01T13:13:03Z) - Unsupervised Part Discovery from Contrastive Reconstruction [90.88501867321573]
The goal of self-supervised visual representation learning is to learn strong, transferable image representations.
We propose an unsupervised approach to object part discovery and segmentation.
Our method yields semantic parts consistent across fine-grained but visually distinct categories.
arXiv Detail & Related papers (2021-11-11T17:59:42Z) - Co-Grounding Networks with Semantic Attention for Referring Expression
Comprehension in Videos [96.85840365678649]
We tackle the problem of referring expression comprehension in videos with an elegant one-stage framework.
We enhance the single-frame grounding accuracy by semantic attention learning and improve the cross-frame grounding consistency.
Our model is also applicable to referring expression comprehension in images, illustrated by the improved performance on the RefCOCO dataset.
arXiv Detail & Related papers (2021-03-23T06:42:49Z) - InstanceRefer: Cooperative Holistic Understanding for Visual Grounding
on Point Clouds through Instance Multi-level Contextual Referring [38.13420293700949]
We propose a new model, named InstanceRefer, to achieve a superior 3D visual grounding on point clouds.
Our model first filters instances from panoptic segmentation on point clouds to obtain a small number of candidates.
Experiments confirm that our InstanceRefer outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2021-03-01T16:59:27Z) - Pairwise Similarity Knowledge Transfer for Weakly Supervised Object
Localization [53.99850033746663]
We study the problem of learning localization model on target classes with weakly supervised image labels.
In this work, we argue that learning only an objectness function is a weak form of knowledge transfer.
Experiments on the COCO and ILSVRC 2013 detection datasets show that the performance of the localization model improves significantly with the inclusion of pairwise similarity function.
arXiv Detail & Related papers (2020-03-18T17:53:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.