Bridge to Answer: Structure-aware Graph Interaction Network for Video
Question Answering
- URL: http://arxiv.org/abs/2104.14085v1
- Date: Thu, 29 Apr 2021 03:02:37 GMT
- Title: Bridge to Answer: Structure-aware Graph Interaction Network for Video
Question Answering
- Authors: Jungin Park, Jiyoung Lee, Kwanghoon Sohn
- Abstract summary: This paper presents a novel method, termed Bridge to Answer, to infer correct answers for questions about a given video.
We learn question conditioned visual graphs by exploiting the relation between video and question to enable each visual node using question-to-visual interactions.
Our method can learn the question conditioned visual representations attributed to appearance and motion that show powerful capability for video question answering.
- Score: 56.65656211928256
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a novel method, termed Bridge to Answer, to infer correct
answers for questions about a given video by leveraging adequate graph
interactions of heterogeneous crossmodal graphs. To realize this, we learn
question conditioned visual graphs by exploiting the relation between video and
question to enable each visual node using question-to-visual interactions to
encompass both visual and linguistic cues. In addition, we propose bridged
visual-to-visual interactions to incorporate two complementary visual
information on appearance and motion by placing the question graph as an
intermediate bridge. This bridged architecture allows reliable message passing
through compositional semantics of the question to generate an appropriate
answer. As a result, our method can learn the question conditioned visual
representations attributed to appearance and motion that show powerful
capability for video question answering. Extensive experiments prove that the
proposed method provides effective and superior performance than
state-of-the-art methods on several benchmarks.
Related papers
- Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - Visual Commonsense based Heterogeneous Graph Contrastive Learning [79.22206720896664]
We propose a heterogeneous graph contrastive learning method to better finish the visual reasoning task.
Our method is designed as a plug-and-play way, so that it can be quickly and easily combined with a wide range of representative methods.
arXiv Detail & Related papers (2023-11-11T12:01:18Z) - Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in
Visual Question Answering [71.6781118080461]
We propose a Graph Matching Attention (GMA) network for Visual Question Answering (VQA) task.
firstly, it builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information.
Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question.
Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset.
arXiv Detail & Related papers (2021-12-14T10:01:26Z) - Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z) - Cross-modal Knowledge Reasoning for Knowledge-based Visual Question
Answering [27.042604046441426]
Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image.
In this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views.
We decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol.
We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA.
arXiv Detail & Related papers (2020-08-31T23:25:01Z) - Location-aware Graph Convolutional Networks for Video Question Answering [85.44666165818484]
We propose to represent the contents in the video as a location-aware graph.
Based on the constructed graph, we propose to use graph convolution to infer both the category and temporal locations of an action.
Our method significantly outperforms state-of-the-art methods on TGIF-QA, Youtube2Text-QA, and MSVD-QA datasets.
arXiv Detail & Related papers (2020-08-07T02:12:56Z) - Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual
Question Answering [26.21870452615222]
FVQA requires external knowledge beyond visible content to answer questions about an image.
How to capture the question-oriented and information-complementary evidence remains a key challenge to solve the problem.
We propose a modality-aware heterogeneous graph convolutional network to capture evidence from different layers that is most relevant to the given question.
arXiv Detail & Related papers (2020-06-16T11:03:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.