DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering
- URL: http://arxiv.org/abs/2107.04768v1
- Date: Sat, 10 Jul 2021 06:08:15 GMT
- Title: DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering
- Authors: Jianyu Wang, Bing-Kun Bao, Changsheng Xu
- Abstract summary: We propose a Dual-Visual Graph Reasoning Unit (DualVGR) which reasons over videos in an end-to-end fashion.
Our DualVGR network achieves state-of-the-art performance on the benchmark MSVD-QA and SVQA datasets.
- Score: 75.01757991135567
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video question answering is a challenging task, which requires agents to be
able to understand rich video contents and perform spatial-temporal reasoning.
However, existing graph-based methods fail to perform multi-step reasoning
well, neglecting two properties of VideoQA: (1) Even for the same video,
different questions may require different amount of video clips or objects to
infer the answer with relational reasoning; (2) During reasoning, appearance
and motion features have complicated interdependence which are correlated and
complementary to each other. Based on these observations, we propose a
Dual-Visual Graph Reasoning Unit (DualVGR) which reasons over videos in an
end-to-end fashion. The first contribution of our DualVGR is the design of an
explainable Query Punishment Module, which can filter out irrelevant visual
features through multiple cycles of reasoning. The second contribution is the
proposed Video-based Multi-view Graph Attention Network, which captures the
relations between appearance and motion features. Our DualVGR network achieves
state-of-the-art performance on the benchmark MSVD-QA and SVQA datasets, and
demonstrates competitive results on benchmark MSRVTT-QA datasets. Our code is
available at https://github.com/MMIR/DualVGR-VideoQA.
Related papers
- Multi-object event graph representation learning for Video Question Answering [4.236280446793381]
We propose a contrastive language event graph representation learning method called CLanG to address this limitation.
Our method outperforms a strong baseline, achieving up to 2.2% higher accuracy on two challenging VideoQA, NExT-QA and TGIF-QA-R datasets.
arXiv Detail & Related papers (2024-09-12T04:42:51Z) - Video Captioning with Aggregated Features Based on Dual Graphs and Gated
Fusion [6.096411752534632]
The application of video captioning models aims at translating content of videos by using accurate natural language.
Existing methods often fail in generating sufficient feature representations of video content.
We propose a video captioning model based on dual graphs and gated fusion.
arXiv Detail & Related papers (2023-08-13T05:18:08Z) - Contrastive Video Question Answering via Video Graph Transformer [184.3679515511028]
We propose a Video Graph Transformer model (CoVGT) to perform question answering (VideoQA) in a Contrastive manner.
CoVGT's uniqueness and superiority are three-fold.
We show that CoVGT can achieve much better performances than previous arts on video reasoning tasks.
arXiv Detail & Related papers (2023-02-27T11:09:13Z) - Video Graph Transformer for Video Question Answering [182.14696075946742]
This paper proposes a Video Graph Transformer (VGT) model for Video Quetion Answering (VideoQA)
We show that VGT can achieve much better performances on VideoQA tasks that challenge dynamic relation reasoning than prior arts in the pre-training-free scenario.
arXiv Detail & Related papers (2022-07-12T06:51:32Z) - Rethinking Multi-Modal Alignment in Video Question Answering from
Feature and Sample Perspectives [30.666823939595627]
This paper reconsiders the multi-modal alignment problem in VideoQA from feature and sample perspectives.
We adopt a heterogeneous graph architecture and design a hierarchical framework to align both trajectory-level and frame-level visual feature with language feature.
Our method outperforms all the state-of-the-art models on the challenging NExT-QA benchmark.
arXiv Detail & Related papers (2022-04-25T10:42:07Z) - LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video
Question Answering [50.11756459499762]
We propose a Lightweight Visual-Linguistic Reasoning framework named LiVLR.
LiVLR first utilizes the graph-based Visual and Linguistic ablations to obtain multi-grained visual and linguistic representations.
The proposed LiVLR is lightweight and shows its performance advantage on two VideoQA benchmarks.
arXiv Detail & Related papers (2021-11-29T14:18:47Z) - DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video
Summarization [127.16984421969529]
We introduce a novel Query-Aware Hierarchical Pointer Network for Multi-Video Summarization, termed DeepQAMVS.
DeepQAMVS is trained with reinforcement learning, incorporating rewards that capture representativeness, diversity, query-adaptability and temporal coherence.
We achieve state-of-the-art results on the MVS1K dataset, with inference time scaling linearly with the number of input video frames.
arXiv Detail & Related papers (2021-05-13T17:33:26Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.