An Improved Attention for Visual Question Answering
- URL: http://arxiv.org/abs/2011.02164v3
- Date: Thu, 3 Jun 2021 19:59:08 GMT
- Title: An Improved Attention for Visual Question Answering
- Authors: Tanzila Rahman, Shih-Han Chou, Leonid Sigal and Giuseppe Carenini
- Abstract summary: We consider the problem of Visual Question Answering (VQA)
Given an image and a free-form, open-ended, question, expressed in natural language, the goal of VQA system is to provide accurate answer to this question with respect to the image.
Attention, which captures intra- and inter-modal dependencies, has emerged as perhaps the most widely used mechanism for addressing these challenges.
- Score: 46.89101543660587
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of Visual Question Answering (VQA). Given an image
and a free-form, open-ended, question, expressed in natural language, the goal
of VQA system is to provide accurate answer to this question with respect to
the image. The task is challenging because it requires simultaneous and
intricate understanding of both visual and textual information. Attention,
which captures intra- and inter-modal dependencies, has emerged as perhaps the
most widely used mechanism for addressing these challenges. In this paper, we
propose an improved attention-based architecture to solve VQA. We incorporate
an Attention on Attention (AoA) module within encoder-decoder framework, which
is able to determine the relation between attention results and queries.
Attention module generates weighted average for each query. On the other hand,
AoA module first generates an information vector and an attention gate using
attention results and current context; and then adds another attention to
generate final attended information by multiplying the two. We also propose
multimodal fusion module to combine both visual and textual information. The
goal of this fusion module is to dynamically decide how much information should
be considered from each modality. Extensive experiments on VQA-v2 benchmark
dataset show that our method achieves the state-of-the-art performance.
Related papers
- Object Attribute Matters in Visual Question Answering [15.705504296316576]
We propose a novel VQA approach from the perspective of utilizing object attribute.
The attribute fusion module constructs a multimodal graph neural network to fuse attributes and visual features through message passing.
The better object-level visual-language alignment aids in understanding multimodal scenes, thereby improving the model's robustness.
arXiv Detail & Related papers (2023-12-20T12:46:30Z) - LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images.
We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information.
Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z) - AVIS: Autonomous Visual Information Seeking with Large Language Model
Agent [123.75169211547149]
We propose an autonomous information seeking visual question answering framework, AVIS.
Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools.
AVIS achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks such as Infoseek and OK-VQA.
arXiv Detail & Related papers (2023-06-13T20:50:22Z) - VQA with Cascade of Self- and Co-Attention Blocks [3.0013352260516744]
This work aims to learn an improved multi-modal representation through dense interaction of visual and textual modalities.
The proposed model has an attention block containing both self-attention and co-attention on image and text.
arXiv Detail & Related papers (2023-02-28T17:20:40Z) - VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks
for Visual Question Answering [79.22069768972207]
We propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations.
Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context.
On two challenging VQA tasks, our method outperforms strong baseline VQA methods by 3.2% on VCR and 4.6% on GQA, suggesting its strength in performing concept-level reasoning.
arXiv Detail & Related papers (2022-05-23T17:55:34Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in
Visual Question Answering [71.6781118080461]
We propose a Graph Matching Attention (GMA) network for Visual Question Answering (VQA) task.
firstly, it builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information.
Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question.
Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset.
arXiv Detail & Related papers (2021-12-14T10:01:26Z) - Cross-modal Knowledge Reasoning for Knowledge-based Visual Question
Answering [27.042604046441426]
Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image.
In this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views.
We decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol.
We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA.
arXiv Detail & Related papers (2020-08-31T23:25:01Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.