Location-aware Graph Convolutional Networks for Video Question Answering
- URL: http://arxiv.org/abs/2008.09105v1
- Date: Fri, 7 Aug 2020 02:12:56 GMT
- Title: Location-aware Graph Convolutional Networks for Video Question Answering
- Authors: Deng Huang, Peihao Chen, Runhao Zeng, Qing Du, Mingkui Tan, Chuang Gan
- Abstract summary: We propose to represent the contents in the video as a location-aware graph.
Based on the constructed graph, we propose to use graph convolution to infer both the category and temporal locations of an action.
Our method significantly outperforms state-of-the-art methods on TGIF-QA, Youtube2Text-QA, and MSVD-QA datasets.
- Score: 85.44666165818484
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We addressed the challenging task of video question answering, which requires
machines to answer questions about videos in a natural language form. Previous
state-of-the-art methods attempt to apply spatio-temporal attention mechanism
on video frame features without explicitly modeling the location and relations
among object interaction occurred in videos. However, the relations between
object interaction and their location information are very critical for both
action recognition and question reasoning. In this work, we propose to
represent the contents in the video as a location-aware graph by incorporating
the location information of an object into the graph construction. Here, each
node is associated with an object represented by its appearance and location
features. Based on the constructed graph, we propose to use graph convolution
to infer both the category and temporal locations of an action. As the graph is
built on objects, our method is able to focus on the foreground action contents
for better video question answering. Lastly, we leverage an attention mechanism
to combine the output of graph convolution and encoded question features for
final answer reasoning. Extensive experiments demonstrate the effectiveness of
the proposed methods. Specifically, our method significantly outperforms
state-of-the-art methods on TGIF-QA, Youtube2Text-QA, and MSVD-QA datasets.
Code and pre-trained models are publicly available at:
https://github.com/SunDoge/L-GCN
Related papers
- Multi-object event graph representation learning for Video Question Answering [4.236280446793381]
We propose a contrastive language event graph representation learning method called CLanG to address this limitation.
Our method outperforms a strong baseline, achieving up to 2.2% higher accuracy on two challenging VideoQA, NExT-QA and TGIF-QA-R datasets.
arXiv Detail & Related papers (2024-09-12T04:42:51Z) - VideoSAGE: Video Summarization with Graph Representation Learning [9.21019970479227]
We propose a graph-based representation learning framework for video summarization.
A graph constructed this way aims to capture long-range interactions among video frames, and the sparsity ensures the model trains without hitting the memory and compute bottleneck.
arXiv Detail & Related papers (2024-04-14T15:49:02Z) - Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question
Answering [16.502197578954917]
graph-based methods for VideoQA usually ignore keywords in questions and employ a simple graph to aggregate features.
We propose a Keyword-aware Relative Spatio-Temporal (KRST) graph network for VideoQA.
arXiv Detail & Related papers (2023-07-25T04:41:32Z) - Learning Situation Hyper-Graphs for Video Question Answering [95.18071873415556]
We propose an architecture for Video Question Answering (VQA) that enables answering questions related to video content by predicting situation hyper-graphs.
We train a situation hyper-graph decoder to implicitly identify graph representations with actions and object/human-object relationships from the input video clip.
Our results show that learning the underlying situation hyper-graphs helps the system to significantly improve its performance for novel challenges of video question-answering tasks.
arXiv Detail & Related papers (2023-04-18T01:23:11Z) - Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z) - Hierarchical Object-oriented Spatio-Temporal Reasoning for Video
Question Answering [27.979053252431306]
Video Question Answering (Video QA) is a powerful testbed to develop new AI capabilities.
We propose an object-oriented reasoning approach in that video is abstracted as a dynamic stream of interacting objects.
This mechanism is materialized into a family of general-purpose neural units and their multi-level architecture.
arXiv Detail & Related papers (2021-06-25T05:12:42Z) - Bridge to Answer: Structure-aware Graph Interaction Network for Video
Question Answering [56.65656211928256]
This paper presents a novel method, termed Bridge to Answer, to infer correct answers for questions about a given video.
We learn question conditioned visual graphs by exploiting the relation between video and question to enable each visual node using question-to-visual interactions.
Our method can learn the question conditioned visual representations attributed to appearance and motion that show powerful capability for video question answering.
arXiv Detail & Related papers (2021-04-29T03:02:37Z) - SumGraph: Video Summarization via Recursive Graph Modeling [59.01856443537622]
We propose graph modeling networks for video summarization, termed SumGraph, to represent a relation graph.
We achieve state-of-the-art performance on several benchmarks for video summarization in both supervised and unsupervised manners.
arXiv Detail & Related papers (2020-07-17T08:11:30Z) - Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks [150.5425122989146]
This work proposes a novel attentive graph neural network (AGNN) for zero-shot video object segmentation (ZVOS)
AGNN builds a fully connected graph to efficiently represent frames as nodes, and relations between arbitrary frame pairs as edges.
Experimental results on three video segmentation datasets show that AGNN sets a new state-of-the-art in each case.
arXiv Detail & Related papers (2020-01-19T10:45:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.