Context Disentangling and Prototype Inheriting for Robust Visual
Grounding
- URL: http://arxiv.org/abs/2312.11967v1
- Date: Tue, 19 Dec 2023 09:03:53 GMT
- Title: Context Disentangling and Prototype Inheriting for Robust Visual
Grounding
- Authors: Wei Tang, Liang Li, Xuejing Liu, Lu Jin, Jinhui Tang and Zechao Li
- Abstract summary: Visual grounding (VG) aims to locate a specific target in an image based on a given language query.
We propose a novel framework with context disentangling and prototype inheriting for robust visual grounding to handle both scenes.
Our method outperforms the state-of-the-art methods in both scenarios.
- Score: 56.63007386345772
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual grounding (VG) aims to locate a specific target in an image based on a
given language query. The discriminative information from context is important
for distinguishing the target from other objects, particularly for the targets
that have the same category as others. However, most previous methods
underestimate such information. Moreover, they are usually designed for the
standard scene (without any novel object), which limits their generalization to
the open-vocabulary scene. In this paper, we propose a novel framework with
context disentangling and prototype inheriting for robust visual grounding to
handle both scenes. Specifically, the context disentangling disentangles the
referent and context features, which achieves better discrimination between
them. The prototype inheriting inherits the prototypes discovered from the
disentangled visual features by a prototype bank to fully utilize the seen
data, especially for the open-vocabulary scene. The fused features, obtained by
leveraging Hadamard product on disentangled linguistic and visual features of
prototypes to avoid sharp adjusting the importance between the two types of
features, are then attached with a special token and feed to a vision
Transformer encoder for bounding box regression. Extensive experiments are
conducted on both standard and open-vocabulary scenes. The performance
comparisons indicate that our method outperforms the state-of-the-art methods
in both scenarios. {The code is available at
https://github.com/WayneTomas/TransCP.
Related papers
- In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding.
Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding [8.448399308205266]
We introduce an evaluation protocol based on dynamic vocabulary generation to test whether models detect, discern, and assign the correct fine-grained description to objects.
We further enhance our investigation by evaluating several state-of-the-art open-vocabulary object detectors using the proposed protocol.
arXiv Detail & Related papers (2023-11-29T10:40:52Z) - CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes [93.71909293023663]
Cross-modality Aligned Prototypes (CAPro) is a unified contrastive learning framework to learn visual representations with correct semantics.
CAPro achieves new state-of-the-art performance and exhibits robustness to open-set recognition.
arXiv Detail & Related papers (2023-10-15T07:20:22Z) - Audience-Centric Natural Language Generation via Style Infusion [5.6732899077715375]
We propose the novel task of style infusion - infusing the stylistic preferences of audiences in pretrained language generation models.
We leverage limited pairwise human judgments to bootstrap a style analysis model and augment our seed set of judgments.
Our infusion approach can generate compelling stylized examples with generic text prompts.
arXiv Detail & Related papers (2023-01-24T19:57:50Z) - Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z) - Improving Visual Grounding with Visual-Linguistic Verification and
Iterative Reasoning [42.29650807349636]
We propose a transformer-based framework for accurate visual grounding.
We develop a visual-linguistic verification module to focus the visual features on regions relevant to the textual descriptions.
A language-guided feature encoder is also devised to aggregate the visual contexts of the target object to improve the object's distinctiveness.
arXiv Detail & Related papers (2022-04-30T13:48:15Z) - MOC-GAN: Mixing Objects and Captions to Generate Realistic Images [21.240099965546637]
We introduce a more rational setting, generating a realistic image from the objects and captions.
Under this setting, objects explicitly define the critical roles in the targeted images and captions implicitly describe their rich attributes and connections.
A MOC-GAN is proposed to mix the inputs of two modalities to generate realistic images.
arXiv Detail & Related papers (2021-06-06T14:04:07Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.