Shifting More Attention to Visual Backbone: Query-modulated Refinement
Networks for End-to-End Visual Grounding
- URL: http://arxiv.org/abs/2203.15442v1
- Date: Tue, 29 Mar 2022 11:17:23 GMT
- Title: Shifting More Attention to Visual Backbone: Query-modulated Refinement
Networks for End-to-End Visual Grounding
- Authors: Jiabo Ye, Junfeng Tian, Ming Yan, Xiaoshan Yang, Xuwu Wang, Ji Zhang,
Liang He, Xin Lin
- Abstract summary: Existing methods use pre-trained query-agnostic visual backbones to extract visual feature maps independently.
We argue that the visual features extracted from the visual backbones and the features needed for multimodal reasoning are inconsistent.
We propose a Query-modulated Refinement Network (QRNet) to address the inconsistent issue.
- Score: 35.44496191453257
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual grounding focuses on establishing fine-grained alignment between
vision and natural language, which has essential applications in multimodal
reasoning systems. Existing methods use pre-trained query-agnostic visual
backbones to extract visual feature maps independently without considering the
query information. We argue that the visual features extracted from the visual
backbones and the features really needed for multimodal reasoning are
inconsistent. One reason is that there are differences between pre-training
tasks and visual grounding. Moreover, since the backbones are query-agnostic,
it is difficult to completely avoid the inconsistency issue by training the
visual backbone end-to-end in the visual grounding framework. In this paper, we
propose a Query-modulated Refinement Network (QRNet) to address the
inconsistent issue by adjusting intermediate features in the visual backbone
with a novel Query-aware Dynamic Attention (QD-ATT) mechanism and query-aware
multiscale fusion. The QD-ATT can dynamically compute query-dependent visual
attention at the spatial and channel levels of the feature maps produced by the
visual backbone. We apply the QRNet to an end-to-end visual grounding
framework. Extensive experiments show that the proposed method outperforms
state-of-the-art methods on five widely used datasets.
Related papers
- Interpretable Visual Question Answering via Reasoning Supervision [4.76359068115052]
Transformer-based architectures have recently demonstrated remarkable performance in the Visual Question Answering (VQA) task.
We propose a novel architecture for visual question answering that leverages common sense reasoning as a supervisory signal.
We demonstrate both quantitatively and qualitatively that the proposed approach can boost the model's visual perception capability and lead to performance increase.
arXiv Detail & Related papers (2023-09-07T14:12:31Z) - LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images.
We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information.
Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z) - Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know
How to Reason? [30.16956370267339]
We introduce a protocol to evaluate visual representations for the task of Visual Question Answering.
In order to decouple visual feature extraction from reasoning, we design a specific attention-based reasoning module.
We compare two types of visual representations, densely extracted local features and object-centric ones, against the performances of a perfect image representation using ground truth.
arXiv Detail & Related papers (2022-12-20T14:36:45Z) - Bear the Query in Mind: Visual Grounding with Query-conditioned
Convolution [26.523051615516742]
We propose a Query-conditioned Convolution Module (QCM) that extracts query-aware visual features by incorporating query information into the generation of convolutional kernels.
Our method achieves state-of-the-art performance on three popular visual grounding datasets.
arXiv Detail & Related papers (2022-06-18T04:26:39Z) - Visual Superordinate Abstraction for Robust Concept Learning [80.15940996821541]
Concept learning constructs visual representations that are connected to linguistic semantics.
We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts.
We propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces.
arXiv Detail & Related papers (2022-05-28T14:27:38Z) - Good Visual Guidance Makes A Better Extractor: Hierarchical Visual
Prefix for Multimodal Entity and Relation Extraction [88.6585431949086]
We propose a novel Hierarchical Visual Prefix fusion NeTwork (HVPNeT) for visual-enhanced entity and relation extraction.
We regard visual representation as pluggable visual prefix to guide the textual representation for error insensitive forecasting decision.
Experiments on three benchmark datasets demonstrate the effectiveness of our method, and achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-05-07T02:10:55Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - Unpaired Referring Expression Grounding via Bidirectional Cross-Modal
Matching [53.27673119360868]
Referring expression grounding is an important and challenging task in computer vision.
We propose a novel bidirectional cross-modal matching (BiCM) framework to address these challenges.
Our framework outperforms previous works by 6.55% and 9.94% on two popular grounding datasets.
arXiv Detail & Related papers (2022-01-18T01:13:19Z) - ORD: Object Relationship Discovery for Visual Dialogue Generation [60.471670447176656]
We propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation.
A hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally.
Experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships.
arXiv Detail & Related papers (2020-06-15T12:25:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.