SA-VQA: Structured Alignment of Visual and Semantic Representations for
Visual Question Answering
- URL: http://arxiv.org/abs/2201.10654v1
- Date: Tue, 25 Jan 2022 22:26:09 GMT
- Title: SA-VQA: Structured Alignment of Visual and Semantic Representations for
Visual Question Answering
- Authors: Peixi Xiong, Quanzeng You, Pei Yu, Zicheng Liu, Ying Wu
- Abstract summary: We propose to apply structured alignments, which work with graph representation of visual and textual content.
As demonstrated in our experimental results, such a structured alignment improves reasoning performance.
The proposed model, without any pretraining, outperforms the state-of-the-art methods on GQA dataset, and beats the non-pretrained state-of-the-art methods on VQA-v2 dataset.
- Score: 29.96818189046649
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Question Answering (VQA) attracts much attention from both industry
and academia. As a multi-modality task, it is challenging since it requires not
only visual and textual understanding, but also the ability to align
cross-modality representations. Previous approaches extensively employ
entity-level alignments, such as the correlations between the visual regions
and their semantic labels, or the interactions across question words and object
features. These attempts aim to improve the cross-modality representations,
while ignoring their internal relations. Instead, we propose to apply
structured alignments, which work with graph representation of visual and
textual content, aiming to capture the deep connections between the visual and
textual modalities. Nevertheless, it is nontrivial to represent and integrate
graphs for structured alignments. In this work, we attempt to solve this issue
by first converting different modality entities into sequential nodes and the
adjacency graph, then incorporating them for structured alignments. As
demonstrated in our experimental results, such a structured alignment improves
reasoning performance. In addition, our model also exhibits better
interpretability for each generated answer. The proposed model, without any
pretraining, outperforms the state-of-the-art methods on GQA dataset, and beats
the non-pretrained state-of-the-art methods on VQA-v2 dataset.
Related papers
- Conversational Semantic Parsing using Dynamic Context Graphs [68.72121830563906]
We consider the task of conversational semantic parsing over general purpose knowledge graphs (KGs) with millions of entities, and thousands of relation-types.
We focus on models which are capable of interactively mapping user utterances into executable logical forms.
arXiv Detail & Related papers (2023-05-04T16:04:41Z) - AlignVE: Visual Entailment Recognition Based on Alignment Relations [32.190603887676666]
Visual entailment (VE) is to recognize whether the semantics of a hypothesis text can be inferred from the given premise image.
New architecture called AlignVE is proposed to solve the visual entailment problem with a relation interaction method.
Our architecture reaches 72.45% accuracy on SNLI-VE dataset, outperforming previous content-based models under the same settings.
arXiv Detail & Related papers (2022-11-16T07:52:24Z) - TokenFlow: Rethinking Fine-grained Cross-modal Alignment in
Vision-Language Retrieval [30.429340065755436]
We devise a new model-agnostic formulation for fine-grained cross-modal alignment.
Inspired by optimal transport theory, we introduce emphTokenFlow, an instantiation of the proposed scheme.
arXiv Detail & Related papers (2022-09-28T04:11:05Z) - Scene Graph Modification as Incremental Structure Expanding [61.84291817776118]
We focus on scene graph modification (SGM), where the system is required to learn how to update an existing scene graph based on a natural language query.
We frame SGM as a graph expansion task by introducing the incremental structure expanding (ISE)
We construct a challenging dataset that contains more complicated queries and larger scene graphs than existing datasets.
arXiv Detail & Related papers (2022-09-15T16:26:14Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in
Visual Question Answering [71.6781118080461]
We propose a Graph Matching Attention (GMA) network for Visual Question Answering (VQA) task.
firstly, it builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information.
Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question.
Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset.
arXiv Detail & Related papers (2021-12-14T10:01:26Z) - Dynamic Language Binding in Relational Visual Reasoning [67.85579756590478]
We present Language-binding Object Graph Network, the first neural reasoning method with dynamic relational structures across both visual and textual domains.
Our method outperforms other methods in sophisticated question-answering tasks wherein multiple object relations are involved.
arXiv Detail & Related papers (2020-04-30T06:26:20Z) - Iterative Context-Aware Graph Inference for Visual Dialog [126.016187323249]
We propose a novel Context-Aware Graph (CAG) neural network.
Each node in the graph corresponds to a joint semantic feature, including both object-based (visual) and history-related (textual) context representations.
arXiv Detail & Related papers (2020-04-05T13:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.