Iterative Robust Visual Grounding with Masked Reference based
Centerpoint Supervision
- URL: http://arxiv.org/abs/2307.12392v1
- Date: Sun, 23 Jul 2023 17:55:24 GMT
- Title: Iterative Robust Visual Grounding with Masked Reference based
Centerpoint Supervision
- Authors: Menghao Li, Chunlei Wang, Wenquan Feng, Shuchang Lyu, Guangliang
Cheng, Xiangtai Li, Binghao Liu, Qi Zhao
- Abstract summary: We propose an Iterative Robust Visual Grounding (IR-VG) framework with Masked Reference based Centerpoint Supervision (MRCS)
The proposed framework is evaluated on five regular VG datasets and two newly constructed robust VG datasets.
- Score: 24.90534567531536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Grounding (VG) aims at localizing target objects from an image based
on given expressions and has made significant progress with the development of
detection and vision transformer. However, existing VG methods tend to generate
false-alarm objects when presented with inaccurate or irrelevant descriptions,
which commonly occur in practical applications. Moreover, existing methods fail
to capture fine-grained features, accurate localization, and sufficient context
comprehension from the whole image and textual descriptions. To address both
issues, we propose an Iterative Robust Visual Grounding (IR-VG) framework with
Masked Reference based Centerpoint Supervision (MRCS). The framework introduces
iterative multi-level vision-language fusion (IMVF) for better alignment. We
use MRCS to ahieve more accurate localization with point-wised feature
supervision. Then, to improve the robustness of VG, we also present a
multi-stage false-alarm sensitive decoder (MFSD) to prevent the generation of
false-alarm objects when presented with inaccurate expressions. The proposed
framework is evaluated on five regular VG datasets and two newly constructed
robust VG datasets. Extensive experiments demonstrate that IR-VG achieves new
state-of-the-art (SOTA) results, with improvements of 25\% and 10\% compared to
existing SOTA approaches on the two newly proposed robust VG datasets.
Moreover, the proposed framework is also verified effective on five regular VG
datasets. Codes and models will be publicly at
https://github.com/cv516Buaa/IR-VG.
Related papers
- DVF: Advancing Robust and Accurate Fine-Grained Image Retrieval with Retrieval Guidelines [67.44394651662738]
Fine-grained image retrieval (FGIR) is to learn visual representations that distinguish visually similar objects while maintaining generalization.
Existing methods propose to generate discriminative features, but rarely consider the particularity of the FGIR task itself.
This paper proposes practical guidelines to identify subcategory-specific discrepancies and generate discriminative features to design effective FGIR models.
arXiv Detail & Related papers (2024-04-24T09:45:12Z) - A Bayesian Approach to OOD Robustness in Image Classification [20.104489420303306]
We introduce a novel Bayesian approach to OOD robustness for object classification.
We exploit the fact that CompNets contain a generative head defined over feature vectors represented by von Mises-Fisher (vMF) kernels.
This enables us to learn a transitional dictionary of vMF kernels that are intermediate between the source and target domains.
arXiv Detail & Related papers (2024-03-12T03:15:08Z) - Diffusion-Based Particle-DETR for BEV Perception [94.88305708174796]
Bird-Eye-View (BEV) is one of the most widely-used scene representations for visual perception in Autonomous Vehicles (AVs)
Recent diffusion-based methods offer a promising approach to uncertainty modeling for visual perception but fail to effectively detect small objects in the large coverage of the BEV.
Here, we address this problem by combining the diffusion paradigm with current state-of-the-art 3D object detectors in BEV.
arXiv Detail & Related papers (2023-12-18T09:52:14Z) - GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging
Cross-Modal Attention with Large Language Models [17.488420164181463]
This paper introduces a sophisticated encoder-decoder framework to address visual grounding in autonomous vehicles (AVs)
Our Context-Aware Visual Grounding (CAVG) model is an advanced system that integrates five core encoders-Text, Image, Context, and Cross-Modal-with a Multimodal decoder.
arXiv Detail & Related papers (2023-12-06T15:14:30Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic
Manipulation [20.041507826568093]
Grounding Vision to Ceaselessly Created Instructions (GVCCI) is a lifelong learning framework for Language-Guided Robotic Manipulation (LGRM)
GVCCI iteratively generates synthetic instruction via object detection and trains the VG model with the generated data.
Experimental results show that GVCCI leads to a steady improvement in VG by up to 56.7% and improves LGRM by up to 29.4%.
arXiv Detail & Related papers (2023-07-12T07:12:20Z) - Cluster-level pseudo-labelling for source-free cross-domain facial
expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER)
Our method exploits self-supervised pretraining to learn good feature representations from the target data.
We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z) - RendNet: Unified 2D/3D Recognizer With Latent Space Rendering [18.877203720641393]
We argue that the VG-to-RG rendering process is essential to effectively combine VG and RG information.
We propose RendNet, a unified architecture for recognition on both 2D and 3D scenarios.
arXiv Detail & Related papers (2022-06-21T01:23:11Z) - Visual Grounding with Transformers [43.40192909920495]
Our approach is built on top of a transformer encoder-decoder and is independent of any pretrained detectors or word embedding models.
Our method outperforms state-of-the-art proposal-free approaches by a considerable margin on five benchmarks.
arXiv Detail & Related papers (2021-05-10T11:46:12Z) - MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations.
Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.