Related papers: Contrastive Video Question Answering via Video Graph Transformer

Contrastive Video Question Answering via Video Graph Transformer

URL: http://arxiv.org/abs/2302.13668v2
Date: Tue, 11 Jul 2023 12:00:52 GMT
Title: Contrastive Video Question Answering via Video Graph Transformer
Authors: Junbin Xiao, Pan Zhou, Angela Yao, Yicong Li, Richang Hong, Shuicheng Yan and Tat-Seng Chua
Abstract summary: We propose a Video Graph Transformer model (CoVGT) to perform question answering (VideoQA) in a Contrastive manner. CoVGT's uniqueness and superiority are three-fold. We show that CoVGT can achieve much better performances than previous arts on video reasoning tasks.
Score: 184.3679515511028
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification. Fine-grained video-text communication is done by additional cross-modal interaction modules. 3) It is optimized by the joint fully- and self-supervised contrastive objectives between the correct and incorrect answers, as well as the relevant and irrelevant questions respectively. With superior video encoding and QA solution, we show that CoVGT can achieve much better performances than previous arts on video reasoning tasks. Its performances even surpass those models that are pretrained with millions of external data. We further show that CoVGT can also benefit from cross-modal pretraining, yet with orders of magnitude smaller data. The results demonstrate the effectiveness and superiority of CoVGT, and additionally reveal its potential for more data-efficient pretraining. We hope our success can advance VideoQA beyond coarse recognition/description towards fine-grained relation reasoning of video contents. Our code is available at https://github.com/doc-doc/CoVGT.

Related papers

ViGT: Proposal-free Video Grounding with Learnable Token in Transformer [28.227291816020646]
Video grounding task aims to locate queried action or event in an untrimmed video based on rich linguistic descriptions. Existing proposal-free methods are trapped in complex interaction between video and query. We propose a novel boundary regression paradigm that performs regression token learning in a transformer.
arXiv Detail & Related papers (2023-08-11T08:30:08Z)
Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering. We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA. Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z)
Video Graph Transformer for Video Question Answering [182.14696075946742]
This paper proposes a Video Graph Transformer (VGT) model for Video Quetion Answering (VideoQA) We show that VGT can achieve much better performances on VideoQA tasks that challenge dynamic relation reasoning than prior arts in the pre-training-free scenario.
arXiv Detail & Related papers (2022-07-12T06:51:32Z)
Learning to Answer Visual Questions from Web Videos [89.71617065426146]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. For a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations.
arXiv Detail & Related papers (2022-05-10T16:34:26Z)
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data. We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z)
Generative Video Transformer: Can Objects be the Words? [22.788711301106765]
We propose the Object-Centric Video Transformer (OCVT) which utilizes an object-centric approach for decomposing scenes into tokens suitable for use in a generative video transformer. By factoring video into objects, our fully unsupervised model is able to learn complex-temporal dynamics of multiple objects in a scene and generate future frames of the video. Our model is also significantly more memory-efficient than pixel-based models and thus able to train on videos of length up to 70 frames with a single 48GB GPU.
arXiv Detail & Related papers (2021-07-20T03:08:39Z)
Just Ask: Learning to Answer Questions from Millions of Narrated Videos [97.44376735445454]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. We show our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA.
arXiv Detail & Related papers (2020-12-01T12:59:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.