Self-supervised pre-training and contrastive representation learning for
multiple-choice video QA
- URL: http://arxiv.org/abs/2009.08043v2
- Date: Mon, 14 Dec 2020 11:32:24 GMT
- Title: Self-supervised pre-training and contrastive representation learning for
multiple-choice video QA
- Authors: Seonhoon Kim, Seohyeong Jeong, Eunbyul Kim, Inho Kang, Nojun Kwak
- Abstract summary: Video Question Answering (Video QA) requires fine-grained understanding of both video and language modalities to answer the given questions.
We propose novel training schemes for multiple-choice video question answering with a self-supervised pre-training stage and a supervised contrastive learning in the main stage as an auxiliary learning.
We evaluate our proposed model on highly competitive benchmark datasets related to multiple-choice video QA: TVQA, TVQA+, and DramaQA.
- Score: 39.78914328623504
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Question Answering (Video QA) requires fine-grained understanding of
both video and language modalities to answer the given questions. In this
paper, we propose novel training schemes for multiple-choice video question
answering with a self-supervised pre-training stage and a supervised
contrastive learning in the main stage as an auxiliary learning. In the
self-supervised pre-training stage, we transform the original problem format of
predicting the correct answer into the one that predicts the relevant question
to provide a model with broader contextual inputs without any further dataset
or annotation. For contrastive learning in the main stage, we add a masking
noise to the input corresponding to the ground-truth answer, and consider the
original input of the ground-truth answer as a positive sample, while treating
the rest as negative samples. By mapping the positive sample closer to the
masked input, we show that the model performance is improved. We further employ
locally aligned attention to focus more effectively on the video frames that
are particularly relevant to the given corresponding subtitle sentences. We
evaluate our proposed model on highly competitive benchmark datasets related to
multiple-choice video QA: TVQA, TVQA+, and DramaQA. Experimental results show
that our model achieves state-of-the-art performance on all datasets. We also
validate our approaches through further analyses.
Related papers
- VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms.
Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner.
Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Auxiliary Learning for Self-Supervised Video Representation via
Similarity-based Knowledge Distillation [2.6519061087638014]
We propose a novel approach to complement self-supervised pretraining via an auxiliary pretraining phase, based on knowledge similarity distillation, auxSKD.
Our method deploys a teacher network that iteratively distils its knowledge to the student model by capturing the similarity information between segments of unlabelled video data.
We also introduce a novel pretext task, Video Segment Pace Prediction or VSPP, which requires our model to predict the playback speed of a randomly selected segment of the input video to provide more reliable self-supervised representations.
arXiv Detail & Related papers (2021-12-07T21:50:40Z) - CUPID: Adaptive Curation of Pre-training Data for Video-and-Language
Representation Learning [49.18591896085498]
We propose CUPID to bridge the domain gap between source and target data.
CUPID yields new state-of-the-art performance across multiple video-language and video tasks.
arXiv Detail & Related papers (2021-04-01T06:42:16Z) - A Hierarchical Reasoning Graph Neural Network for The Automatic Scoring
of Answer Transcriptions in Video Job Interviews [14.091472037847499]
We propose a Hierarchical Reasoning Graph Neural Network (HRGNN) for the automatic assessment of question-answer pairs.
We employ a semantic-level reasoning graph attention network to model the interaction states of the current QA session.
Finally, we propose a gated recurrent unit encoder to represent the temporal question-answer pairs for the final prediction.
arXiv Detail & Related papers (2020-12-22T12:27:45Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.