Unpaired Referring Expression Grounding via Bidirectional Cross-Modal
Matching
- URL: http://arxiv.org/abs/2201.06686v1
- Date: Tue, 18 Jan 2022 01:13:19 GMT
- Title: Unpaired Referring Expression Grounding via Bidirectional Cross-Modal
Matching
- Authors: Hengcan Shi, Munawar Hayat, Jianfei Cai
- Abstract summary: Referring expression grounding is an important and challenging task in computer vision.
We propose a novel bidirectional cross-modal matching (BiCM) framework to address these challenges.
Our framework outperforms previous works by 6.55% and 9.94% on two popular grounding datasets.
- Score: 53.27673119360868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring expression grounding is an important and challenging task in
computer vision. To avoid the laborious annotation in conventional referring
grounding, unpaired referring grounding is introduced, where the training data
only contains a number of images and queries without correspondences. The few
existing solutions to unpaired referring grounding are still preliminary, due
to the challenges of learning image-text matching and lack of the top-down
guidance with unpaired data. In this paper, we propose a novel bidirectional
cross-modal matching (BiCM) framework to address these challenges.
Particularly, we design a query-aware attention map (QAM) module that
introduces top-down perspective via generating query-specific visual attention
maps. A cross-modal object matching (COM) module is further introduced, which
exploits the recently emerged image-text matching pretrained model, CLIP, to
predict the target objects from a bottom-up perspective. The top-down and
bottom-up predictions are then integrated via a similarity funsion (SF) module.
We also propose a knowledge adaptation matching (KAM) module that leverages
unpaired training data to adapt pretrained knowledge to the target dataset and
task. Experiments show that our framework outperforms previous works by 6.55%
and 9.94% on two popular grounding datasets.
Related papers
- Decoupling the Class Label and the Target Concept in Machine Unlearning [81.69857244976123]
Machine unlearning aims to adjust a trained model to approximate a retrained one that excludes a portion of training data.
Previous studies showed that class-wise unlearning is successful in forgetting the knowledge of a target class.
We propose a general framework, namely, TARget-aware Forgetting (TARF)
arXiv Detail & Related papers (2024-06-12T14:53:30Z) - Learning Cross-view Visual Geo-localization without Ground Truth [48.51859322439286]
Cross-View Geo-Localization (CVGL) involves determining the geographical location of a query image by matching it with a corresponding GPS-tagged reference image.
Current state-of-the-art methods rely on training models with labeled paired images, incurring substantial annotation costs and training burdens.
We investigate the adaptation of frozen models for CVGL without requiring ground truth pair labels.
arXiv Detail & Related papers (2024-03-19T13:01:57Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Causal Scene BERT: Improving object detection by searching for
challenging groups of data [125.40669814080047]
Computer vision applications rely on learning-based perception modules parameterized with neural networks for tasks like object detection.
These modules frequently have low expected error overall but high error on atypical groups of data due to biases inherent in the training process.
Our main contribution is a pseudo-automatic method to discover such groups in foresight by performing causal interventions on simulated scenes.
arXiv Detail & Related papers (2022-02-08T05:14:16Z) - Detecting Human-Object Interactions with Object-Guided Cross-Modal
Calibrated Semantics [6.678312249123534]
We aim to boost end-to-end models with object-guided statistical priors.
We propose to utilize a Verb Semantic Model (VSM) and use semantic aggregation to profit from this object-guided hierarchy.
The above modules combined composes Object-guided Cross-modal Network (OCN)
arXiv Detail & Related papers (2022-02-01T07:39:04Z) - Relation-aware Instance Refinement for Weakly Supervised Visual
Grounding [44.33411132188231]
Visual grounding aims to build a correspondence between visual objects and their language entities.
We propose a novel weakly-supervised learning method that incorporates coarse-to-fine object refinement and entity relation modeling.
Experiments on two public benchmarks demonstrate the efficacy of our framework.
arXiv Detail & Related papers (2021-03-24T05:03:54Z) - Referring Expression Comprehension: A Survey of Methods and Datasets [20.42495629501261]
Referring expression comprehension (REC) aims to localize a target object in an image described by a referring expression phrased in natural language.
We first examine the state of the art by comparing modern approaches to the problem.
We discuss modular architectures and graph-based models that interface with structured graph representation.
arXiv Detail & Related papers (2020-07-19T01:45:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.