Improving Visual Grounding with Visual-Linguistic Verification and
Iterative Reasoning
- URL: http://arxiv.org/abs/2205.00272v1
- Date: Sat, 30 Apr 2022 13:48:15 GMT
- Title: Improving Visual Grounding with Visual-Linguistic Verification and
Iterative Reasoning
- Authors: Li Yang, Yan Xu, Chunfeng Yuan, Wei Liu, Bing Li, Weiming Hu
- Abstract summary: We propose a transformer-based framework for accurate visual grounding.
We develop a visual-linguistic verification module to focus the visual features on regions relevant to the textual descriptions.
A language-guided feature encoder is also devised to aggregate the visual contexts of the target object to improve the object's distinctiveness.
- Score: 42.29650807349636
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual grounding is a task to locate the target indicated by a natural
language expression. Existing methods extend the generic object detection
framework to this problem. They base the visual grounding on the features from
pre-generated proposals or anchors, and fuse these features with the text
embeddings to locate the target mentioned by the text. However, modeling the
visual features from these predefined locations may fail to fully exploit the
visual context and attribute information in the text query, which limits their
performance. In this paper, we propose a transformer-based framework for
accurate visual grounding by establishing text-conditioned discriminative
features and performing multi-stage cross-modal reasoning. Specifically, we
develop a visual-linguistic verification module to focus the visual features on
regions relevant to the textual descriptions while suppressing the unrelated
areas. A language-guided feature encoder is also devised to aggregate the
visual contexts of the target object to improve the object's distinctiveness.
To retrieve the target from the encoded visual features, we further propose a
multi-stage cross-modal decoder to iteratively speculate on the correlations
between the image and text for accurate target localization. Extensive
experiments on five widely used datasets validate the efficacy of our proposed
components and demonstrate state-of-the-art performance. Our code is public at
https://github.com/yangli18/VLTVG.
Related papers
- LOGO: Video Text Spotting with Language Collaboration and Glyph Perception Model [20.007650672107566]
Video text spotting (VTS) aims to simultaneously localize, recognize and track text instances in videos.
Recent methods track the zero-shot results of state-of-the-art image text spotters directly.
Fine-tuning transformer-based text spotters on specific datasets could yield performance enhancements.
arXiv Detail & Related papers (2024-05-29T15:35:09Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Direction-Oriented Visual-semantic Embedding Model for Remote Sensing
Image-text Retrieval [8.00022369501487]
We propose a novel Direction-Oriented Visual-semantic Embedding Model (DOVE) to mine the relationship between vision and language.
Our highlight is to conduct visual and textual representations in latent space, directing them as close as possible to a redundancy-free regional visual representation.
We exploit a global visual-semantic constraint to reduce single visual dependency and serve as an external constraint for the final visual and textual representations.
arXiv Detail & Related papers (2023-10-12T12:28:47Z) - CiteTracker: Correlating Image and Text for Visual Tracking [114.48653709286629]
We propose the CiteTracker to enhance target modeling and inference in visual tracking by connecting images and text.
Specifically, we develop a text generation module to convert the target image patch into a descriptive text.
We then associate the target description and the search image using an attention-based correlation module to generate the correlated features for target state reference.
arXiv Detail & Related papers (2023-08-22T09:53:12Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Bear the Query in Mind: Visual Grounding with Query-conditioned
Convolution [26.523051615516742]
We propose a Query-conditioned Convolution Module (QCM) that extracts query-aware visual features by incorporating query information into the generation of convolutional kernels.
Our method achieves state-of-the-art performance on three popular visual grounding datasets.
arXiv Detail & Related papers (2022-06-18T04:26:39Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z) - ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural
Language [36.319953919737245]
Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches the given textual descriptions.
We propose an attribute-aligning perspective that allows grounding specific attribute phrases to the corresponding visual regions.
We achieve success as well as the performance boosting by a robust feature learning.
arXiv Detail & Related papers (2020-05-15T02:22:28Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.