VQA with Cascade of Self- and Co-Attention Blocks
- URL: http://arxiv.org/abs/2302.14777v1
- Date: Tue, 28 Feb 2023 17:20:40 GMT
- Title: VQA with Cascade of Self- and Co-Attention Blocks
- Authors: Aakansha Mishra, Ashish Anand, Prithwijit Guha
- Abstract summary: This work aims to learn an improved multi-modal representation through dense interaction of visual and textual modalities.
The proposed model has an attention block containing both self-attention and co-attention on image and text.
- Score: 3.0013352260516744
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The use of complex attention modules has improved the performance of the
Visual Question Answering (VQA) task. This work aims to learn an improved
multi-modal representation through dense interaction of visual and textual
modalities. The proposed model has an attention block containing both
self-attention and co-attention on image and text. The self-attention modules
provide the contextual information of objects (for an image) and words (for a
question) that are crucial for inferring an answer. On the other hand,
co-attention aids the interaction of image and text. Further, fine-grained
information is obtained from two modalities by using a Cascade of Self- and
Co-Attention blocks (CSCA). This proposal is benchmarked on the widely used
VQA2.0 and TDIUC datasets. The efficacy of key components of the model and
cascading of attention modules are demonstrated by experiments involving
ablation analysis.
Related papers
- Why context matters in VQA and Reasoning: Semantic interventions for VLM input modalities [18.859309032300402]
We investigate how the integration of information from image and text modalities influences the performance and behavior of Visual Language Model (VLM) predictions.
We study the interplay between text and image modalities in different configurations where visual content is essential for solving the VQA task.
Our results show that complementary information between modalities improves answer and reasoning quality, while contradictory information harms model performance and confidence.
arXiv Detail & Related papers (2024-10-02T16:02:02Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Object Attribute Matters in Visual Question Answering [15.705504296316576]
We propose a novel VQA approach from the perspective of utilizing object attribute.
The attribute fusion module constructs a multimodal graph neural network to fuse attributes and visual features through message passing.
The better object-level visual-language alignment aids in understanding multimodal scenes, thereby improving the model's robustness.
arXiv Detail & Related papers (2023-12-20T12:46:30Z) - LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images.
We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information.
Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - An experimental study of the vision-bottleneck in VQA [17.132865538874352]
We study the vision-bottleneck in Visual Question Answering (VQA)
We experiment with both the quantity and quality of visual objects extracted from images.
We also study the impact of two methods to incorporate the information about objects necessary for answering a question.
arXiv Detail & Related papers (2022-02-14T16:43:32Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - CoADNet: Collaborative Aggregation-and-Distribution Networks for
Co-Salient Object Detection [91.91911418421086]
Co-Salient Object Detection (CoSOD) aims at discovering salient objects that repeatedly appear in a given query group containing two or more relevant images.
One challenging issue is how to effectively capture co-saliency cues by modeling and exploiting inter-image relationships.
We present an end-to-end collaborative aggregation-and-distribution network (CoADNet) to capture both salient and repetitive visual patterns from multiple images.
arXiv Detail & Related papers (2020-11-10T04:28:11Z) - An Improved Attention for Visual Question Answering [46.89101543660587]
We consider the problem of Visual Question Answering (VQA)
Given an image and a free-form, open-ended, question, expressed in natural language, the goal of VQA system is to provide accurate answer to this question with respect to the image.
Attention, which captures intra- and inter-modal dependencies, has emerged as perhaps the most widely used mechanism for addressing these challenges.
arXiv Detail & Related papers (2020-11-04T07:34:54Z) - Visual Relationship Detection with Visual-Linguistic Knowledge from
Multimodal Representations [103.00383924074585]
Visual relationship detection aims to reason over relationships among salient objects in images.
We propose a novel approach named Visual-Linguistic Representations from Transformers (RVL-BERT)
RVL-BERT performs spatial reasoning with both visual and language commonsense knowledge learned via self-supervised pre-training.
arXiv Detail & Related papers (2020-09-10T16:15:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.