On the Efficacy of Co-Attention Transformer Layers in Visual Question
Answering
- URL: http://arxiv.org/abs/2201.03965v1
- Date: Tue, 11 Jan 2022 14:25:17 GMT
- Title: On the Efficacy of Co-Attention Transformer Layers in Visual Question
Answering
- Authors: Ankur Sikarwar and Gabriel Kreiman
- Abstract summary: We investigate the efficacy of co-attention transformer layers in helping the network focus on relevant regions while answering the question.
We generate visual attention maps using the question-conditioned image attention scores in these co-attention layers.
Our work sheds light on the function and interpretation of co-attention transformer layers, highlights gaps in current networks, and can guide the development of future VQA models.
- Score: 5.547800834335381
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, multi-modal transformers have shown significant progress in
Vision-Language tasks, such as Visual Question Answering (VQA), outperforming
previous architectures by a considerable margin. This improvement in VQA is
often attributed to the rich interactions between vision and language streams.
In this work, we investigate the efficacy of co-attention transformer layers in
helping the network focus on relevant regions while answering the question. We
generate visual attention maps using the question-conditioned image attention
scores in these co-attention layers. We evaluate the effect of the following
critical components on visual attention of a state-of-the-art VQA model: (i)
number of object region proposals, (ii) question part of speech (POS) tags,
(iii) question semantics, (iv) number of co-attention layers, and (v) answer
accuracy. We compare the neural network attention maps against human attention
maps both qualitatively and quantitatively. Our findings indicate that
co-attention transformer modules are crucial in attending to relevant regions
of the image given a question. Importantly, we observe that the semantic
meaning of the question is not what drives visual attention, but specific
keywords in the question do. Our work sheds light on the function and
interpretation of co-attention transformer layers, highlights gaps in current
networks, and can guide the development of future VQA models and networks that
simultaneously process visual and language streams.
Related papers
- DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - Segmentation-guided Attention for Visual Question Answering from Remote Sensing Images [1.6932802756478726]
Visual Question Answering for Remote Sensing (RSVQA) is a task that aims at answering natural language questions about the content of a remote sensing image.
We propose to embed an attention mechanism guided by segmentation into a RSVQA pipeline.
We provide a new VQA dataset that exploits very high-resolution RGB orthophotos annotated with 16 segmentation classes and question/answer pairs.
arXiv Detail & Related papers (2024-07-11T16:59:32Z) - Convolution-enhanced Evolving Attention Networks [41.684265133316096]
Evolving Attention-enhanced Dilated Convolutional (EA-DC-) Transformer outperforms state-of-the-art models significantly.
This is the first work that explicitly models the layer-wise evolution of attention maps.
arXiv Detail & Related papers (2022-12-16T08:14:04Z) - Weakly Supervised Grounding for VQA in Vision-Language Transformers [112.5344267669495]
This paper focuses on the problem of weakly supervised grounding in context of visual question answering in transformers.
The approach leverages capsules by grouping each visual token in the visual encoder.
We evaluate our approach on the challenging GQA as well as VQA-HAT dataset for VQA grounding.
arXiv Detail & Related papers (2022-07-05T22:06:03Z) - From Pixels to Objects: Cubic Visual Attention for Visual Question
Answering [132.95819467484517]
Recently, attention-based Visual Question Answering (VQA) has achieved great success by utilizing question to target different visual areas that are related to the answer.
We propose a Cubic Visual Attention (CVA) model by successfully applying a novel channel and spatial attention on object regions to improve VQA task.
Experimental results show that our proposed method significantly outperforms the state-of-the-arts.
arXiv Detail & Related papers (2022-06-04T07:03:18Z) - An experimental study of the vision-bottleneck in VQA [17.132865538874352]
We study the vision-bottleneck in Visual Question Answering (VQA)
We experiment with both the quantity and quality of visual objects extracted from images.
We also study the impact of two methods to incorporate the information about objects necessary for answering a question.
arXiv Detail & Related papers (2022-02-14T16:43:32Z) - VQA-MHUG: A Gaze Dataset to Study Multimodal Neural Attention in Visual
Question Answering [15.017443876780286]
We present VQA-MHUG - a novel dataset of multimodal human gaze on both images and questions during visual question answering (VQA)
We use our dataset to analyze the similarity between human and neural attentive strategies learned by five state-of-the-art VQA models.
arXiv Detail & Related papers (2021-09-27T15:06:10Z) - Transformer Interpretability Beyond Attention Visualization [87.96102461221415]
Self-attention techniques, and specifically Transformers, are dominating the field of text processing.
In this work, we propose a novel way to compute relevancy for Transformer networks.
arXiv Detail & Related papers (2020-12-17T18:56:33Z) - Multi-Head Attention: Collaborate Instead of Concatenate [85.71058762269374]
We propose a collaborative multi-head attention layer that enables heads to learn shared projections.
Experiments confirm that sharing key/query dimensions can be exploited in language understanding, machine translation and vision.
arXiv Detail & Related papers (2020-06-29T20:28:52Z) - Ventral-Dorsal Neural Networks: Object Detection via Selective Attention [51.79577908317031]
We propose a new framework called Ventral-Dorsal Networks (VDNets)
Inspired by the structure of the human visual system, we propose the integration of a "Ventral Network" and a "Dorsal Network"
Our experimental results reveal that the proposed method outperforms state-of-the-art object detection approaches.
arXiv Detail & Related papers (2020-05-15T23:57:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.