VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive
Learning
- URL: http://arxiv.org/abs/2106.11250v1
- Date: Mon, 21 Jun 2021 16:48:19 GMT
- Title: VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive
Learning
- Authors: Hao Tan, Jie Lei, Thomas Wolf, Mohit Bansal
- Abstract summary: Video understanding relies on perceiving the global content modeling its internal connections.
We propose a block-wise strategy where we mask neighboring video tokens in both spatial and temporal domains.
We also add an augmentation-free contrastive learning method to further capture global content.
- Score: 82.09856883441044
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video understanding relies on perceiving the global content and modeling its
internal connections (e.g., causality, movement, and spatio-temporal
correspondence). To learn these interactions, we apply a mask-then-predict
pre-training task on discretized video tokens generated via VQ-VAE. Unlike
language, where the text tokens are more independent, neighboring video tokens
typically have strong correlations (e.g., consecutive video frames usually look
very similar), and hence uniformly masking individual tokens will make the task
too trivial to learn useful representations. To deal with this issue, we
propose a block-wise masking strategy where we mask neighboring video tokens in
both spatial and temporal domains. We also add an augmentation-free contrastive
learning method to further capture the global content by predicting whether the
video clips are sampled from the same video. We pre-train our model on
uncurated videos and show that our pre-trained model can reach state-of-the-art
results on several video understanding datasets (e.g., SSV2, Diving48). Lastly,
we provide detailed analyses on model scalability and pre-training method
design. Code is released at https://github.com/airsplay/vimpac.
Related papers
- OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Contrastive Masked Autoencoders for Self-Supervised Video Hashing [54.636976693527636]
Self-Supervised Video Hashing (SSVH) models learn to generate short binary representations for videos without ground-truth supervision.
We propose a simple yet effective one-stage SSVH method called ConMH, which incorporates video semantic information and video similarity relationship understanding.
arXiv Detail & Related papers (2022-11-21T06:48:14Z) - Show Me What and Tell Me How: Video Synthesis via Multimodal
Conditioning [36.85533835408882]
This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately.
We propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens.
Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images.
arXiv Detail & Related papers (2022-03-04T21:09:13Z) - VIOLET : End-to-End Video-Language Transformers with Masked Visual-token
Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data.
We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.