VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
- URL: http://arxiv.org/abs/2303.16727v2
- Date: Tue, 18 Apr 2023 11:46:41 GMT
- Title: VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
- Authors: Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang,
Yali Wang, Yu Qiao
- Abstract summary: Video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models.
We successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance.
- Score: 57.552798046137646
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scale is the primary factor for building a powerful foundation model that
could well generalize to a variety of downstream tasks. However, it is still
challenging to train video foundation models with billions of parameters. This
paper shows that video masked autoencoder (VideoMAE) is a scalable and general
self-supervised pre-trainer for building video foundation models. We scale the
VideoMAE in both model and data with a core design. Specifically, we present a
dual masking strategy for efficient pre-training, with an encoder operating on
a subset of video tokens and a decoder processing another subset of video
tokens. Although VideoMAE is very efficient due to high masking ratio in
encoder, masking decoder can still further reduce the overall computational
cost. This enables the efficient pre-training of billion-level models in video.
We also use a progressive training paradigm that involves an initial
pre-training on a diverse multi-sourced unlabeled dataset, followed by a
post-pre-training on a mixed labeled dataset. Finally, we successfully train a
video ViT model with a billion parameters, which achieves a new
state-of-the-art performance on the datasets of Kinetics (90.0% on K400 and
89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). In
addition, we extensively verify the pre-trained video ViT models on a variety
of downstream tasks, demonstrating its effectiveness as a general video
representation learner. The code and model is available at
\url{https://github.com/OpenGVLab/VideoMAEv2}.
Related papers
- Asymmetric Masked Distillation for Pre-Training Small Foundation Models [52.56257450614992]
Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding.
This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks.
We propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding.
arXiv Detail & Related papers (2023-11-06T14:44:34Z) - Harvest Video Foundation Models via Efficient Post-Pretraining [67.30842563833185]
We propose an efficient framework to harvest video foundation models from image ones.
Our method is intuitively simple by randomly dropping input video patches and masking out input text during the post-pretraining procedure.
Our method achieves state-of-the-art performances, which are comparable to some heavily pretrained video foundation models.
arXiv Detail & Related papers (2023-10-30T14:06:16Z) - Unmasked Teacher: Towards Training-Efficient Video Foundation Models [50.19560876891811]
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity.
This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods.
Our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding.
arXiv Detail & Related papers (2023-03-28T15:39:28Z) - It Takes Two: Masked Appearance-Motion Modeling for Self-supervised
Video Transformer Pre-training [76.69480467101143]
Self-supervised video transformer pre-training has recently benefited from the mask-and-predict pipeline.
We explicitly investigate motion cues in videos as extra prediction target and propose our Masked Appearance-Motion Modeling framework.
Our method learns generalized video representations and achieves 82.3% on Kinects-400, 71.3% on Something-Something V2, 91.5% on UCF101, and 62.5% on HMDB51.
arXiv Detail & Related papers (2022-10-11T08:05:18Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.