Video Graph Transformer for Video Question Answering
- URL: http://arxiv.org/abs/2207.05342v1
- Date: Tue, 12 Jul 2022 06:51:32 GMT
- Title: Video Graph Transformer for Video Question Answering
- Authors: Junbin Xiao, Pan Zhou, Tat-Seng Chua, Shuicheng Yan
- Abstract summary: This paper proposes a Video Graph Transformer (VGT) model for Video Quetion Answering (VideoQA)
We show that VGT can achieve much better performances on VideoQA tasks that challenge dynamic relation reasoning than prior arts in the pre-training-free scenario.
- Score: 182.14696075946742
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper proposes a Video Graph Transformer (VGT) model for Video Quetion
Answering (VideoQA). VGT's uniqueness are two-fold: 1) it designs a dynamic
graph transformer module which encodes video by explicitly capturing the visual
objects, their relations, and dynamics for complex spatio-temporal reasoning;
and 2) it exploits disentangled video and text Transformers for relevance
comparison between the video and text to perform QA, instead of entangled
cross-modal Transformer for answer classification. Vision-text communication is
done by additional cross-modal interaction modules. With more reasonable video
encoding and QA solution, we show that VGT can achieve much better performances
on VideoQA tasks that challenge dynamic relation reasoning than prior arts in
the pretraining-free scenario. Its performances even surpass those models that
are pretrained with millions of external data. We further show that VGT can
also benefit a lot from self-supervised cross-modal pretraining, yet with
orders of magnitude smaller data. These results clearly demonstrate the
effectiveness and superiority of VGT, and reveal its potential for more
data-efficient pretraining. With comprehensive analyses and some heuristic
observations, we hope that VGT can promote VQA research beyond coarse
recognition/description towards fine-grained relation reasoning in realistic
videos. Our code is available at https://github.com/sail-sg/VGT.
Related papers
- Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR [51.72751335574947]
Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes.
Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers)
This paper posits that the VL Transformers do not exhibit visual commonsense, which is the key to VCR.
arXiv Detail & Related papers (2024-05-27T08:26:58Z) - ViGT: Proposal-free Video Grounding with Learnable Token in Transformer [28.227291816020646]
Video grounding task aims to locate queried action or event in an untrimmed video based on rich linguistic descriptions.
Existing proposal-free methods are trapped in complex interaction between video and query.
We propose a novel boundary regression paradigm that performs regression token learning in a transformer.
arXiv Detail & Related papers (2023-08-11T08:30:08Z) - Contrastive Video Question Answering via Video Graph Transformer [184.3679515511028]
We propose a Video Graph Transformer model (CoVGT) to perform question answering (VideoQA) in a Contrastive manner.
CoVGT's uniqueness and superiority are three-fold.
We show that CoVGT can achieve much better performances than previous arts on video reasoning tasks.
arXiv Detail & Related papers (2023-02-27T11:09:13Z) - Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering.
We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA.
Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z) - Video Transformers: A Survey [42.314208650554264]
We study the contributions and trends for adapting Transformers to model video data.
Specifically, we delve into how videos are embedded and tokenized, finding a very widspread use of large CNN backbones.
Also, we analyse the self-supervised losses used to train Video Transformers, which to date are mostly constrained to contrastive approaches.
arXiv Detail & Related papers (2022-01-16T07:31:55Z) - VIOLET : End-to-End Video-Language Transformers with Masked Visual-token
Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data.
We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z) - Generative Video Transformer: Can Objects be the Words? [22.788711301106765]
We propose the Object-Centric Video Transformer (OCVT) which utilizes an object-centric approach for decomposing scenes into tokens suitable for use in a generative video transformer.
By factoring video into objects, our fully unsupervised model is able to learn complex-temporal dynamics of multiple objects in a scene and generate future frames of the video.
Our model is also significantly more memory-efficient than pixel-based models and thus able to train on videos of length up to 70 frames with a single 48GB GPU.
arXiv Detail & Related papers (2021-07-20T03:08:39Z) - DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering [75.01757991135567]
We propose a Dual-Visual Graph Reasoning Unit (DualVGR) which reasons over videos in an end-to-end fashion.
Our DualVGR network achieves state-of-the-art performance on the benchmark MSVD-QA and SVQA datasets.
arXiv Detail & Related papers (2021-07-10T06:08:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.