Self-supervised Video Representation Learning with Motion-Aware Masked
Autoencoders
- URL: http://arxiv.org/abs/2210.04154v1
- Date: Sun, 9 Oct 2022 03:22:15 GMT
- Title: Self-supervised Video Representation Learning with Motion-Aware Masked
Autoencoders
- Authors: Haosen Yang, Deng Huang, Bin Wen, Jiannan Wu, Hongxun Yao, Yi Jiang,
Xiatian Zhu, Zehuan Yuan
- Abstract summary: Masked autoencoders (MAEs) have emerged recently as art self-supervised representation learners.
In this work we present a motion-aware variant -- MotionMAE.
Our model is designed to additionally predict the corresponding motion structure information over time.
- Score: 46.38458873424361
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked autoencoders (MAEs) have emerged recently as art self-supervised
spatiotemporal representation learners. Inheriting from the image counterparts,
however, existing video MAEs still focus largely on static appearance learning
whilst are limited in learning dynamic temporal information hence less
effective for video downstream tasks. To resolve this drawback, in this work we
present a motion-aware variant -- MotionMAE. Apart from learning to reconstruct
individual masked patches of video frames, our model is designed to
additionally predict the corresponding motion structure information over time.
This motion information is available at the temporal difference of nearby
frames. As a result, our model can extract effectively both static appearance
and dynamic motion spontaneously, leading to superior spatiotemporal
representation learning capability. Extensive experiments show that our
MotionMAE outperforms significantly both supervised learning baseline and
state-of-the-art MAE alternatives, under both domain-specific and
domain-generic pretraining-then-finetuning settings. In particular, when using
ViT-B as the backbone our MotionMAE surpasses the prior art model by a margin
of 1.2% on Something-Something V2 and 3.2% on UCF101 in domain-specific
pretraining setting. Encouragingly, it also surpasses the competing MAEs by a
large margin of over 3% on the challenging video object segmentation task. The
code is available at https://github.com/happy-hsy/MotionMAE.
Related papers
- DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking
Tasks [76.24996889649744]
Masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS)
We propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos.
Our model sets new state-of-the-art performance on 8 out of 9 highly competitive video tracking and segmentation datasets.
arXiv Detail & Related papers (2023-04-02T16:40:42Z) - Masked Video Distillation: Rethinking Masked Feature Modeling for
Self-supervised Video Representation Learning [123.63301596019522]
Masked video distillation (MVD) is a simple yet effective two-stage masked feature modeling framework for video representation learning.
For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks.
We design a spatial-temporal co-teaching method for MVD to leverage the advantage of different teachers.
arXiv Detail & Related papers (2022-12-08T18:59:59Z) - Masked Motion Encoding for Self-Supervised Video Representation Learning [84.24773072241945]
We present Masked Motion MME, a new pre-training paradigm that reconstructs both appearance and motion information to explore temporal clues.
Motivated by the fact that human is able to recognize an action by tracking objects' position changes and shape changes, we propose to reconstruct a motion trajectory that represents these two kinds of change in the masked regions.
Pre-trained with our MME paradigm, the model is able to anticipate long-term and fine-grained motion details.
arXiv Detail & Related papers (2022-10-12T11:19:55Z) - Self-Supervised Video Representation Learning with Motion-Contrastive
Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet)
MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP)
Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.