Related papers: Referencing Where to Focus: Improving VisualGrounding with Referential Query

Referencing Where to Focus: Improving VisualGrounding with Referential Query

URL: http://arxiv.org/abs/2412.19155v1
Date: Thu, 26 Dec 2024 10:19:20 GMT
Title: Referencing Where to Focus: Improving VisualGrounding with Referential Query
Authors: Yabing Wang, Zhuotao Tian, Qingpei Guo, Zheng Qin, Sanping Zhou, Ming Yang, Le Wang,
Abstract summary: We propose a novel visual grounding method called RefFormer.<n>It consists of the query adaption module that can be seamlessly integrated into CLIP.<n>Our proposed query adaption module can also act as an adapter, preserving the rich knowledge within CLIP without the need to tune the parameters of the backbone network.
Score: 30.33315985826623
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual Grounding aims to localize the referring object in an image given a natural language expression. Recent advancements in DETR-based visual grounding methods have attracted considerable attention, as they directly predict the coordinates of the target object without relying on additional efforts, such as pre-generated proposal candidates or pre-defined anchor boxes. However, existing research primarily focuses on designing stronger multi-modal decoder, which typically generates learnable queries by random initialization or by using linguistic embeddings. This vanilla query generation approach inevitably increases the learning difficulty for the model, as it does not involve any target-related information at the beginning of decoding. Furthermore, they only use the deepest image feature during the query learning process, overlooking the importance of features from other levels. To address these issues, we propose a novel approach, called RefFormer. It consists of the query adaption module that can be seamlessly integrated into CLIP and generate the referential query to provide the prior context for decoder, along with a task-specific decoder. By incorporating the referential query into the decoder, we can effectively mitigate the learning difficulty of the decoder, and accurately concentrate on the target object. Additionally, our proposed query adaption module can also act as an adapter, preserving the rich knowledge within CLIP without the need to tune the parameters of the backbone network. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method, outperforming state-of-the-art approaches on five visual grounding benchmarks.

Related papers

DesCLIP: Robust Continual Adaptation via General Attribute Descriptions for Pretrained Vision-Language Models [13.917530818500481]
Continual adaptation of vision-language models (VLMs) focuses on leveraging cross-modal pretrained knowledge to incrementally adapt for expanding downstream tasks and datasets. Existing research often focuses on connecting visual features with specific class text in downstream tasks, overlooking the latent relationships between general and specialized knowledge. We propose DesCLIP, which leverages general attribute (GA) descriptions to guide the understanding of specific class objects.
arXiv Detail & Related papers (2025-02-02T01:06:02Z)
Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context. This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
Improving Visual Object Tracking through Visual Prompting [24.436237938873695]
Dynamic target representation adaptation against distractors is challenging due to the limited discriminative capabilities of prevailing trackers. We present a new visual Prompting mechanism for generic Visual Object Tracking (PiVOT) to address this issue. PiVOT proposes a prompt generation network with the pre-trained foundation model CLIP to automatically generate and refine visual prompts.
arXiv Detail & Related papers (2024-09-27T16:39:50Z)
End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting [68.37943632270505]
Open-vocabulary video visual relationship detection aims to expand video visual relationship detection beyond categories. Existing methods usually use trajectory detectors trained on closed datasets to detect object trajectories. We propose an open-vocabulary relationship that leverages the rich semantic knowledge of CLIP to discover novel relationships.
arXiv Detail & Related papers (2024-09-19T06:25:01Z)
Learning Background Prompts to Discover Implicit Knowledge for Open Vocabulary Object Detection [101.15777242546649]
Open vocabulary object detection (OVD) aims at seeking an optimal object detector capable of recognizing objects from both base and novel categories. Recent advances leverage knowledge distillation to transfer insightful knowledge from pre-trained large-scale vision-language models to the task of object detection. We present a novel OVD framework termed LBP to propose learning background prompts to harness explored implicit background knowledge.
arXiv Detail & Related papers (2024-06-01T17:32:26Z)
Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS) We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes. By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z)
Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner. We design a semantic-guided self-supervised learning model to extract high-level semantic features from images. We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z)
Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation [102.25240608024063]
Referring image segments an image from a language expression. We develop an algorithm that shifts from being localization-centric to segmentation-language. Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z)
Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning [42.29650807349636]
We propose a transformer-based framework for accurate visual grounding. We develop a visual-linguistic verification module to focus the visual features on regions relevant to the textual descriptions. A language-guided feature encoder is also devised to aggregate the visual contexts of the target object to improve the object's distinctiveness.
arXiv Detail & Related papers (2022-04-30T13:48:15Z)
Are We There Yet? Learning to Localize in Embodied Instruction Following [1.7300690315775575]
Action Learning From Realistic Environments and Directives (ALFRED) is a recently proposed benchmark for this problem. Key challenges for this task include localizing target locations and navigating to them through visual inputs. We augment the agent's field of view during navigation subgoals with multiple viewing angles, and train the agent to predict its relative spatial relation to the target location at each timestep.
arXiv Detail & Related papers (2021-01-09T21:49:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.