From Pixels to Objects: Cubic Visual Attention for Visual Question
Answering
- URL: http://arxiv.org/abs/2206.01923v1
- Date: Sat, 4 Jun 2022 07:03:18 GMT
- Title: From Pixels to Objects: Cubic Visual Attention for Visual Question
Answering
- Authors: Jingkuan Song, Pengpeng Zeng, Lianli Gao, Heng Tao Shen
- Abstract summary: Recently, attention-based Visual Question Answering (VQA) has achieved great success by utilizing question to target different visual areas that are related to the answer.
We propose a Cubic Visual Attention (CVA) model by successfully applying a novel channel and spatial attention on object regions to improve VQA task.
Experimental results show that our proposed method significantly outperforms the state-of-the-arts.
- Score: 132.95819467484517
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, attention-based Visual Question Answering (VQA) has achieved great
success by utilizing question to selectively target different visual areas that
are related to the answer. Existing visual attention models are generally
planar, i.e., different channels of the last conv-layer feature map of an image
share the same weight. This conflicts with the attention mechanism because CNN
features are naturally spatial and channel-wise. Also, visual attention models
are usually conducted on pixel-level, which may cause region discontinuous
problems. In this paper, we propose a Cubic Visual Attention (CVA) model by
successfully applying a novel channel and spatial attention on object regions
to improve VQA task. Specifically, instead of attending to pixels, we first
take advantage of the object proposal networks to generate a set of object
candidates and extract their associated conv features. Then, we utilize the
question to guide channel attention and spatial attention calculation based on
the con-layer feature map. Finally, the attended visual features and the
question are combined to infer the answer. We assess the performance of our
proposed CVA on three public image QA datasets, including COCO-QA, VQA and
Visual7W. Experimental results show that our proposed method significantly
outperforms the state-of-the-arts.
Related papers
- Segmentation-guided Attention for Visual Question Answering from Remote Sensing Images [1.6932802756478726]
Visual Question Answering for Remote Sensing (RSVQA) is a task that aims at answering natural language questions about the content of a remote sensing image.
We propose to embed an attention mechanism guided by segmentation into a RSVQA pipeline.
We provide a new VQA dataset that exploits very high-resolution RGB orthophotos annotated with 16 segmentation classes and question/answer pairs.
arXiv Detail & Related papers (2024-07-11T16:59:32Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - On the Efficacy of Co-Attention Transformer Layers in Visual Question
Answering [5.547800834335381]
We investigate the efficacy of co-attention transformer layers in helping the network focus on relevant regions while answering the question.
We generate visual attention maps using the question-conditioned image attention scores in these co-attention layers.
Our work sheds light on the function and interpretation of co-attention transformer layers, highlights gaps in current networks, and can guide the development of future VQA models.
arXiv Detail & Related papers (2022-01-11T14:25:17Z) - Coarse-to-Fine Reasoning for Visual Question Answering [18.535633096397397]
We present a new reasoning framework to fill the gap between visual features and semantic clues in the Visual Question Answering (VQA) task.
Our method first extracts the features and predicates from the image and question.
We then propose a new reasoning framework to effectively jointly learn these features and predicates in a coarse-to-fine manner.
arXiv Detail & Related papers (2021-10-06T06:29:52Z) - Found a Reason for me? Weakly-supervised Grounded Visual Question
Answering using Capsules [85.98177341704675]
The problem of grounding VQA tasks has seen an increased attention in the research community recently.
We propose a visual capsule module with a query-based selection mechanism of capsule features.
We show that integrating the proposed capsule module in existing VQA systems significantly improves their performance on the weakly supervised grounding task.
arXiv Detail & Related papers (2021-05-11T07:45:32Z) - Answer Questions with Right Image Regions: A Visual Attention
Regularization Approach [46.55924742590242]
We propose a novel visual attention regularization approach, namely AttReg, for better visual grounding in Visual Question Answering (VQA)
AttReg identifies the image regions which are essential for question answering yet unexpectedly ignored by the backbone model.
It can achieve a new state-of-the-art accuracy of 59.92% with an absolute performance gain of 6.93% on the VQA-CP v2 benchmark dataset.
arXiv Detail & Related papers (2021-02-03T07:33:30Z) - Point and Ask: Incorporating Pointing into Visual Question Answering [14.744503080484977]
We introduce and motivate point-input questions as an extension of Visual Question Answering (VQA)
Pointing is a nearly universal gesture among humans, and real-world VQA is likely to involve a gesture towards the target region.
We uncover and address several visual recognition challenges, including the ability to infer human intent.
arXiv Detail & Related papers (2020-11-27T11:43:45Z) - Location-aware Graph Convolutional Networks for Video Question Answering [85.44666165818484]
We propose to represent the contents in the video as a location-aware graph.
Based on the constructed graph, we propose to use graph convolution to infer both the category and temporal locations of an action.
Our method significantly outperforms state-of-the-art methods on TGIF-QA, Youtube2Text-QA, and MSVD-QA datasets.
arXiv Detail & Related papers (2020-08-07T02:12:56Z) - In Defense of Grid Features for Visual Question Answering [65.71985794097426]
We revisit grid features for visual question answering (VQA) and find they can work surprisingly well.
We verify that this observation holds true across different VQA models and generalizes well to other tasks like image captioning.
We learn VQA models end-to-end, from pixels directly to answers, and show that strong performance is achievable without using any region annotations in pre-training.
arXiv Detail & Related papers (2020-01-10T18:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.