Bear the Query in Mind: Visual Grounding with Query-conditioned
Convolution
- URL: http://arxiv.org/abs/2206.09114v2
- Date: Wed, 22 Jun 2022 02:38:03 GMT
- Title: Bear the Query in Mind: Visual Grounding with Query-conditioned
Convolution
- Authors: Chonghan Chen, Qi Jiang, Chih-Hao Wang, Noel Chen, Haohan Wang, Xiang
Li, Bhiksha Raj
- Abstract summary: We propose a Query-conditioned Convolution Module (QCM) that extracts query-aware visual features by incorporating query information into the generation of convolutional kernels.
Our method achieves state-of-the-art performance on three popular visual grounding datasets.
- Score: 26.523051615516742
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual grounding is a task that aims to locate a target object according to a
natural language expression. As a multi-modal task, feature interaction between
textual and visual inputs is vital. However, previous solutions mainly handle
each modality independently before fusing them together, which does not take
full advantage of relevant textual information while extracting visual
features. To better leverage the textual-visual relationship in visual
grounding, we propose a Query-conditioned Convolution Module (QCM) that
extracts query-aware visual features by incorporating query information into
the generation of convolutional kernels. With our proposed QCM, the downstream
fusion module receives visual features that are more discriminative and focused
on the desired object described in the expression, leading to more accurate
predictions. Extensive experiments on three popular visual grounding datasets
demonstrate that our method achieves state-of-the-art performance. In addition,
the query-aware visual features are informative enough to achieve comparable
performance to the latest methods when directly used for prediction without
further multi-modal fusion.
Related papers
- Object Attribute Matters in Visual Question Answering [15.705504296316576]
We propose a novel VQA approach from the perspective of utilizing object attribute.
The attribute fusion module constructs a multimodal graph neural network to fuse attributes and visual features through message passing.
The better object-level visual-language alignment aids in understanding multimodal scenes, thereby improving the model's robustness.
arXiv Detail & Related papers (2023-12-20T12:46:30Z) - Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection.
We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z) - LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images.
We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information.
Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Good Visual Guidance Makes A Better Extractor: Hierarchical Visual
Prefix for Multimodal Entity and Relation Extraction [88.6585431949086]
We propose a novel Hierarchical Visual Prefix fusion NeTwork (HVPNeT) for visual-enhanced entity and relation extraction.
We regard visual representation as pluggable visual prefix to guide the textual representation for error insensitive forecasting decision.
Experiments on three benchmark datasets demonstrate the effectiveness of our method, and achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-05-07T02:10:55Z) - Improving Visual Grounding with Visual-Linguistic Verification and
Iterative Reasoning [42.29650807349636]
We propose a transformer-based framework for accurate visual grounding.
We develop a visual-linguistic verification module to focus the visual features on regions relevant to the textual descriptions.
A language-guided feature encoder is also devised to aggregate the visual contexts of the target object to improve the object's distinctiveness.
arXiv Detail & Related papers (2022-04-30T13:48:15Z) - Shifting More Attention to Visual Backbone: Query-modulated Refinement
Networks for End-to-End Visual Grounding [35.44496191453257]
Existing methods use pre-trained query-agnostic visual backbones to extract visual feature maps independently.
We argue that the visual features extracted from the visual backbones and the features needed for multimodal reasoning are inconsistent.
We propose a Query-modulated Refinement Network (QRNet) to address the inconsistent issue.
arXiv Detail & Related papers (2022-03-29T11:17:23Z) - ORD: Object Relationship Discovery for Visual Dialogue Generation [60.471670447176656]
We propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation.
A hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally.
Experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships.
arXiv Detail & Related papers (2020-06-15T12:25:40Z) - Dynamic Language Binding in Relational Visual Reasoning [67.85579756590478]
We present Language-binding Object Graph Network, the first neural reasoning method with dynamic relational structures across both visual and textual domains.
Our method outperforms other methods in sophisticated question-answering tasks wherein multiple object relations are involved.
arXiv Detail & Related papers (2020-04-30T06:26:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.