Masked Video Distillation: Rethinking Masked Feature Modeling for
Self-supervised Video Representation Learning
- URL: http://arxiv.org/abs/2212.04500v1
- Date: Thu, 8 Dec 2022 18:59:59 GMT
- Title: Masked Video Distillation: Rethinking Masked Feature Modeling for
Self-supervised Video Representation Learning
- Authors: Rui Wang and Dongdong Chen and Zuxuan Wu and Yinpeng Chen and Xiyang
Dai and Mengchen Liu and Lu Yuan and Yu-Gang Jiang
- Abstract summary: Masked video distillation (MVD) is a simple yet effective two-stage masked feature modeling framework for video representation learning.
For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks.
We design a spatial-temporal co-teaching method for MVD to leverage the advantage of different teachers.
- Score: 123.63301596019522
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Benefiting from masked visual modeling, self-supervised video representation
learning has achieved remarkable progress. However, existing methods focus on
learning representations from scratch through reconstructing low-level features
like raw pixel RGB values. In this paper, we propose masked video distillation
(MVD), a simple yet effective two-stage masked feature modeling framework for
video representation learning: firstly we pretrain an image (or video) model by
recovering low-level features of masked patches, then we use the resulting
features as targets for masked feature modeling. For the choice of teacher
models, we observe that students taught by video teachers perform better on
temporally-heavy video tasks, while image teachers transfer stronger spatial
representations for spatially-heavy video tasks. Visualization analysis also
indicates different teachers produce different learned patterns for students.
Motivated by this observation, to leverage the advantage of different teachers,
we design a spatial-temporal co-teaching method for MVD. Specifically, we
distill student models from both video teachers and image teachers by masked
feature modeling. Extensive experimental results demonstrate that video
transformers pretrained with spatial-temporal co-teaching outperform models
distilled with a single teacher on a multitude of video datasets. Our MVD with
vanilla ViT achieves state-of-the-art performance compared with previous
supervised or self-supervised methods on several challenging video downstream
tasks. For example, with the ViT-Large model, our MVD achieves 86.4% and 75.9%
Top-1 accuracy on Kinetics-400 and Something-Something-v2, outperforming
VideoMAE by 1.2% and 1.6% respectively. Code will be available at
\url{https://github.com/ruiwang2021/mvd}.
Related papers
- Unmasked Teacher: Towards Training-Efficient Video Foundation Models [50.19560876891811]
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity.
This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods.
Our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding.
arXiv Detail & Related papers (2023-03-28T15:39:28Z) - ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders [11.727612242016871]
ViC-MAE is a model that combines Masked AutoEncoders (MAE) and contrastive learning.
We show that visual representations learned under ViC-MAE generalize well to both video and image classification tasks.
arXiv Detail & Related papers (2023-03-21T16:33:40Z) - Stare at What You See: Masked Image Modeling without Reconstruction [154.74533119863864]
Masked Autoencoders (MAE) have been prevailing paradigms for large-scale vision representation pre-training.
Recent approaches apply semantic-rich teacher models to extract image features as the reconstruction target, leading to better performance.
We argue the features extracted by powerful teacher models already encode rich semantic correlation across regions in an intact image.
arXiv Detail & Related papers (2022-11-16T12:48:52Z) - Self-supervised Video Representation Learning with Motion-Aware Masked
Autoencoders [46.38458873424361]
Masked autoencoders (MAEs) have emerged recently as art self-supervised representation learners.
In this work we present a motion-aware variant -- MotionMAE.
Our model is designed to additionally predict the corresponding motion structure information over time.
arXiv Detail & Related papers (2022-10-09T03:22:15Z) - Exploring The Role of Mean Teachers in Self-supervised Masked
Auto-Encoders [64.03000385267339]
Masked image modeling (MIM) has become a popular strategy for self-supervised learning(SSL) of visual representations with Vision Transformers.
We present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE.
RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training.
arXiv Detail & Related papers (2022-10-05T08:08:55Z) - BEVT: BERT Pretraining of Video Transformers [89.08460834954161]
We introduce BEVT which decouples video representation learning into spatial representation learning and temporal dynamics learning.
We conduct extensive experiments on three challenging video benchmarks where BEVT achieves very promising results.
arXiv Detail & Related papers (2021-12-02T18:59:59Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.