VideoMAE: Masked Autoencoders are Data-Efficient Learners for
Self-Supervised Video Pre-Training
- URL: http://arxiv.org/abs/2203.12602v1
- Date: Wed, 23 Mar 2022 17:55:10 GMT
- Title: VideoMAE: Masked Autoencoders are Data-Efficient Learners for
Self-Supervised Video Pre-Training
- Authors: Zhan Tong, Yibing Song, Jue Wang, Limin Wang
- Abstract summary: We show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP)
We are inspired by the recent ImageMAE and propose customized video tube masking and reconstruction.
Our VideoMAE with the vanilla ViT backbone can achieve 83.9% on Kinects-400, 75.3% on Something-Something V2, 90.8% on UCF101, and 61.1% on HMDB51 without using any extra data.
- Score: 49.68815656405452
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-training video transformers on extra large-scale datasets is generally
required to achieve premier performance on relatively small datasets. In this
paper, we show that video masked autoencoders (VideoMAE) are data-efficient
learners for self-supervised video pre-training (SSVP). We are inspired by the
recent ImageMAE and propose customized video tube masking and reconstruction.
These simple designs turn out to be effective for overcoming information
leakage caused by the temporal correlation during video reconstruction. We
obtain three important findings on SSVP: (1) An extremely high proportion of
masking ratio (i.e., 90% to 95%) still yields favorable performance of
VideoMAE. The temporally redundant video content enables higher masking ratio
than that of images. (2) VideoMAE achieves impressive results on very small
datasets (i.e., around 3k-4k videos) without using any extra data. This is
partially ascribed to the challenging task of video reconstruction to enforce
high-level structure learning. (3) VideoMAE shows that data quality is more
important than data quantity for SSVP. Domain shift between pre-training and
target datasets are important issues in SSVP. Notably, our VideoMAE with the
vanilla ViT backbone can achieve 83.9% on Kinects-400, 75.3% on
Something-Something V2, 90.8% on UCF101, and 61.1% on HMDB51 without using any
extra data. Code will be released at https://github.com/MCG-NJU/VideoMAE.
Related papers
- ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning [29.620990627792906]
This paper presents a new self-supervised video representation learning framework, ARVideo, which autoregressively predicts the next video token in a tailored sequence order.
Extensive experiments establish ARVideo as an effective paradigm for self-supervised video representation learning.
arXiv Detail & Related papers (2024-05-24T02:29:03Z) - VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking [57.552798046137646]
Video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models.
We successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-29T14:28:41Z) - Contrastive Masked Autoencoders for Self-Supervised Video Hashing [54.636976693527636]
Self-Supervised Video Hashing (SSVH) models learn to generate short binary representations for videos without ground-truth supervision.
We propose a simple yet effective one-stage SSVH method called ConMH, which incorporates video semantic information and video similarity relationship understanding.
arXiv Detail & Related papers (2022-11-21T06:48:14Z) - It Takes Two: Masked Appearance-Motion Modeling for Self-supervised
Video Transformer Pre-training [76.69480467101143]
Self-supervised video transformer pre-training has recently benefited from the mask-and-predict pipeline.
We explicitly investigate motion cues in videos as extra prediction target and propose our Masked Appearance-Motion Modeling framework.
Our method learns generalized video representations and achieves 82.3% on Kinects-400, 71.3% on Something-Something V2, 91.5% on UCF101, and 62.5% on HMDB51.
arXiv Detail & Related papers (2022-10-11T08:05:18Z) - Boosting Video Representation Learning with Multi-Faceted Integration [112.66127428372089]
Video content is multifaceted, consisting of objects, scenes, interactions or actions.
Existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset.
We propose a new learning framework, MUlti-Faceted Integration (MUFI), to aggregate facets from different datasets for learning a representation that could reflect the full spectrum of video content.
arXiv Detail & Related papers (2022-01-11T16:14:23Z) - Few-Shot Video Object Detection [70.43402912344327]
We introduce Few-Shot Video Object Detection (FSVOD) with three important contributions.
FSVOD-500 comprises of 500 classes with class-balanced videos in each category for few-shot learning.
Our TPN and TMN+ are jointly and end-to-end trained.
arXiv Detail & Related papers (2021-04-30T07:38:04Z) - Creating a Large-scale Synthetic Dataset for Human Activity Recognition [0.8250374560598496]
We use 3D rendering tools to generate a synthetic dataset of videos, and show that a classifier trained on these videos can generalise to real videos.
We fine tune a pre-trained I3D model on our videos, and find that the model is able to achieve a high accuracy of 73% on the HMDB51 dataset over three classes.
arXiv Detail & Related papers (2020-07-21T22:20:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.