CUPID: Adaptive Curation of Pre-training Data for Video-and-Language
Representation Learning
- URL: http://arxiv.org/abs/2104.00285v1
- Date: Thu, 1 Apr 2021 06:42:16 GMT
- Title: CUPID: Adaptive Curation of Pre-training Data for Video-and-Language
Representation Learning
- Authors: Luowei Zhou, Jingjing Liu, Yu Cheng, Zhe Gan, Lei Zhang
- Abstract summary: We propose CUPID to bridge the domain gap between source and target data.
CUPID yields new state-of-the-art performance across multiple video-language and video tasks.
- Score: 49.18591896085498
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work concerns video-language pre-training and representation learning.
In this now ubiquitous training scheme, a model first performs pre-training on
paired videos and text (e.g., video clips and accompanied subtitles) from a
large uncurated source corpus, before transferring to specific downstream
tasks. This two-stage training process inevitably raises questions about the
generalization ability of the pre-trained model, which is particularly
pronounced when a salient domain gap exists between source and target data
(e.g., instructional cooking videos vs. movies). In this paper, we first bring
to light the sensitivity of pre-training objectives (contrastive vs.
reconstructive) to domain discrepancy. Then, we propose a simple yet effective
framework, CUPID, to bridge this domain gap by filtering and adapting source
data to the target data, followed by domain-focused pre-training. Comprehensive
experiments demonstrate that pre-training on a considerably small subset of
domain-focused data can effectively close the source-target domain gap and
achieve significant performance gain, compared to random sampling or even
exploiting the full pre-training dataset. CUPID yields new state-of-the-art
performance across multiple video-language and video tasks, including
text-to-video retrieval [72, 37], video question answering [36], and video
captioning [72], with consistent performance lift over different pre-training
methods.
Related papers
- Learning from One Continuous Video Stream [70.30084026960819]
We introduce a framework for online learning from a single continuous video stream.
This poses great challenges given the high correlation between consecutive video frames.
We employ pixel-to-pixel modelling as a practical and flexible way to switch between pre-training and single-stream evaluation.
arXiv Detail & Related papers (2023-12-01T14:03:30Z) - Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer [79.20605034378187]
Video-language pre-trained models have shown remarkable success in guiding video question-answering tasks.
Due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones.
This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains.
arXiv Detail & Related papers (2023-08-16T15:00:50Z) - Hierarchical Self-supervised Representation Learning for Movie
Understanding [24.952866206036536]
We propose a novel hierarchical self-supervised pretraining strategy that separately pretrains each level of our hierarchical movie understanding model.
Specifically, we propose to pretrain the low-level video backbone using a contrastive learning objective, while pretrain the higher-level video contextualizer using an event mask prediction task.
We first show that our self-supervised pretraining strategies are effective and lead to improved performance on all tasks and metrics on VidSitu benchmark [37] (e.g., improving on semantic role prediction from 47% to 61% CIDEr scores)
arXiv Detail & Related papers (2022-04-06T21:28:41Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training.
To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z) - Auxiliary Learning for Self-Supervised Video Representation via
Similarity-based Knowledge Distillation [2.6519061087638014]
We propose a novel approach to complement self-supervised pretraining via an auxiliary pretraining phase, based on knowledge similarity distillation, auxSKD.
Our method deploys a teacher network that iteratively distils its knowledge to the student model by capturing the similarity information between segments of unlabelled video data.
We also introduce a novel pretext task, Video Segment Pace Prediction or VSPP, which requires our model to predict the playback speed of a randomly selected segment of the input video to provide more reliable self-supervised representations.
arXiv Detail & Related papers (2021-12-07T21:50:40Z) - Unsupervised Domain Adaptation for Video Semantic Segmentation [91.30558794056054]
Unsupervised Domain Adaptation for semantic segmentation has gained immense popularity since it can transfer knowledge from simulation to real.
In this work, we present a new video extension of this task, namely Unsupervised Domain Adaptation for Video Semantic approaches.
We show that our proposals significantly outperform previous image-based UDA methods both on image-level (mIoU) and video-level (VPQ) evaluation metrics.
arXiv Detail & Related papers (2021-07-23T07:18:20Z) - Don't Stop Pretraining: Adapt Language Models to Domains and Tasks [81.99843216550306]
We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks.
A second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains.
Adapting to the task's unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining.
arXiv Detail & Related papers (2020-04-23T04:21:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.