Contrastive Video Question Answering via Video Graph Transformer
- URL: http://arxiv.org/abs/2302.13668v2
- Date: Tue, 11 Jul 2023 12:00:52 GMT
- Title: Contrastive Video Question Answering via Video Graph Transformer
- Authors: Junbin Xiao, Pan Zhou, Angela Yao, Yicong Li, Richang Hong, Shuicheng
Yan and Tat-Seng Chua
- Abstract summary: We propose a Video Graph Transformer model (CoVGT) to perform question answering (VideoQA) in a Contrastive manner.
CoVGT's uniqueness and superiority are three-fold.
We show that CoVGT can achieve much better performances than previous arts on video reasoning tasks.
- Score: 184.3679515511028
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose to perform video question answering (VideoQA) in a Contrastive
manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and
superiority are three-fold: 1) It proposes a dynamic graph transformer module
which encodes video by explicitly capturing the visual objects, their relations
and dynamics, for complex spatio-temporal reasoning. 2) It designs separate
video and text transformers for contrastive learning between the video and text
to perform QA, instead of multi-modal transformer for answer classification.
Fine-grained video-text communication is done by additional cross-modal
interaction modules. 3) It is optimized by the joint fully- and self-supervised
contrastive objectives between the correct and incorrect answers, as well as
the relevant and irrelevant questions respectively. With superior video
encoding and QA solution, we show that CoVGT can achieve much better
performances than previous arts on video reasoning tasks. Its performances even
surpass those models that are pretrained with millions of external data. We
further show that CoVGT can also benefit from cross-modal pretraining, yet with
orders of magnitude smaller data. The results demonstrate the effectiveness and
superiority of CoVGT, and additionally reveal its potential for more
data-efficient pretraining. We hope our success can advance VideoQA beyond
coarse recognition/description towards fine-grained relation reasoning of video
contents. Our code is available at https://github.com/doc-doc/CoVGT.
Related papers
- ViGT: Proposal-free Video Grounding with Learnable Token in Transformer [28.227291816020646]
Video grounding task aims to locate queried action or event in an untrimmed video based on rich linguistic descriptions.
Existing proposal-free methods are trapped in complex interaction between video and query.
We propose a novel boundary regression paradigm that performs regression token learning in a transformer.
arXiv Detail & Related papers (2023-08-11T08:30:08Z) - Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering.
We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA.
Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z) - Video Graph Transformer for Video Question Answering [182.14696075946742]
This paper proposes a Video Graph Transformer (VGT) model for Video Quetion Answering (VideoQA)
We show that VGT can achieve much better performances on VideoQA tasks that challenge dynamic relation reasoning than prior arts in the pre-training-free scenario.
arXiv Detail & Related papers (2022-07-12T06:51:32Z) - Learning to Answer Visual Questions from Web Videos [89.71617065426146]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering.
We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations.
For a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations.
arXiv Detail & Related papers (2022-05-10T16:34:26Z) - VIOLET : End-to-End Video-Language Transformers with Masked Visual-token
Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data.
We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z) - Generative Video Transformer: Can Objects be the Words? [22.788711301106765]
We propose the Object-Centric Video Transformer (OCVT) which utilizes an object-centric approach for decomposing scenes into tokens suitable for use in a generative video transformer.
By factoring video into objects, our fully unsupervised model is able to learn complex-temporal dynamics of multiple objects in a scene and generate future frames of the video.
Our model is also significantly more memory-efficient than pixel-based models and thus able to train on videos of length up to 70 frames with a single 48GB GPU.
arXiv Detail & Related papers (2021-07-20T03:08:39Z) - Just Ask: Learning to Answer Questions from Millions of Narrated Videos [97.44376735445454]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering.
We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations.
We show our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA.
arXiv Detail & Related papers (2020-12-01T12:59:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.