Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in
Visual Question Answering
- URL: http://arxiv.org/abs/2112.07270v1
- Date: Tue, 14 Dec 2021 10:01:26 GMT
- Title: Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in
Visual Question Answering
- Authors: JianJian Cao and Xiameng Qin and Sanyuan Zhao and Jianbing Shen
- Abstract summary: We propose a Graph Matching Attention (GMA) network for Visual Question Answering (VQA) task.
firstly, it builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information.
Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question.
Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset.
- Score: 71.6781118080461
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Answering semantically-complicated questions according to an image is
challenging in Visual Question Answering (VQA) task. Although the image can be
well represented by deep learning, the question is always simply embedded and
cannot well indicate its meaning. Besides, the visual and textual features have
a gap for different modalities, it is difficult to align and utilize the
cross-modality information. In this paper, we focus on these two problems and
propose a Graph Matching Attention (GMA) network. Firstly, it not only builds
graph for the image, but also constructs graph for the question in terms of
both syntactic and embedding information. Next, we explore the intra-modality
relationships by a dual-stage graph encoder and then present a bilateral
cross-modality graph matching attention to infer the relationships between the
image and the question. The updated cross-modality features are then sent into
the answer prediction module for final answer prediction. Experiments
demonstrate that our network achieves state-of-the-art performance on the GQA
dataset and the VQA 2.0 dataset. The ablation studies verify the effectiveness
of each modules in our GMA network.
Related papers
- InstructG2I: Synthesizing Images from Multimodal Attributed Graphs [50.852150521561676]
We propose a graph context-conditioned diffusion model called InstructG2I.
InstructG2I first exploits the graph structure and multimodal information to conduct informative neighbor sampling.
A Graph-QFormer encoder adaptively encodes the graph nodes into an auxiliary set of graph prompts to guide the denoising process.
arXiv Detail & Related papers (2024-10-09T17:56:15Z) - Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering [61.93058781222079]
We develop a flexible question-answering framework targeting real-world textual graphs.
We introduce the first retrieval-augmented generation (RAG) approach for general textual graphs.
G-Retriever performs RAG over a graph by formulating this task as a Prize-Collecting Steiner Tree optimization problem.
arXiv Detail & Related papers (2024-02-12T13:13:04Z) - SceneGATE: Scene-Graph based co-Attention networks for TExt visual
question answering [2.8974040580489198]
The paper proposes a Scene Graph based co-Attention Network (SceneGATE) for TextVQA.
It reveals the semantic relations among the objects, Optical Character Recognition (OCR) tokens and the question words.
It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image.
arXiv Detail & Related papers (2022-12-16T05:10:09Z) - Question-Answer Sentence Graph for Joint Modeling Answer Selection [122.29142965960138]
We train and integrate state-of-the-art (SOTA) models for computing scores between question-question, question-answer, and answer-answer pairs.
Online inference is then performed to solve the AS2 task on unseen queries.
arXiv Detail & Related papers (2022-02-16T05:59:53Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - Cross-modal Knowledge Reasoning for Knowledge-based Visual Question
Answering [27.042604046441426]
Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image.
In this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views.
We decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol.
We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA.
arXiv Detail & Related papers (2020-08-31T23:25:01Z) - Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual
Question Answering [26.21870452615222]
FVQA requires external knowledge beyond visible content to answer questions about an image.
How to capture the question-oriented and information-complementary evidence remains a key challenge to solve the problem.
We propose a modality-aware heterogeneous graph convolutional network to capture evidence from different layers that is most relevant to the given question.
arXiv Detail & Related papers (2020-06-16T11:03:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.