Masked Motion Encoding for Self-Supervised Video Representation Learning
- URL: http://arxiv.org/abs/2210.06096v2
- Date: Thu, 23 Mar 2023 05:50:55 GMT
- Title: Masked Motion Encoding for Self-Supervised Video Representation Learning
- Authors: Xinyu Sun, Peihao Chen, Liangwei Chen, Changhao Li, Thomas H. Li,
Mingkui Tan and Chuang Gan
- Abstract summary: We present Masked Motion MME, a new pre-training paradigm that reconstructs both appearance and motion information to explore temporal clues.
Motivated by the fact that human is able to recognize an action by tracking objects' position changes and shape changes, we propose to reconstruct a motion trajectory that represents these two kinds of change in the masked regions.
Pre-trained with our MME paradigm, the model is able to anticipate long-term and fine-grained motion details.
- Score: 84.24773072241945
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How to learn discriminative video representation from unlabeled videos is
challenging but crucial for video analysis. The latest attempts seek to learn a
representation model by predicting the appearance contents in the masked
regions. However, simply masking and recovering appearance contents may not be
sufficient to model temporal clues as the appearance contents can be easily
reconstructed from a single frame. To overcome this limitation, we present
Masked Motion Encoding (MME), a new pre-training paradigm that reconstructs
both appearance and motion information to explore temporal clues. In MME, we
focus on addressing two critical challenges to improve the representation
performance: 1) how to well represent the possible long-term motion across
multiple frames; and 2) how to obtain fine-grained temporal clues from sparsely
sampled videos. Motivated by the fact that human is able to recognize an action
by tracking objects' position changes and shape changes, we propose to
reconstruct a motion trajectory that represents these two kinds of change in
the masked regions. Besides, given the sparse video input, we enforce the model
to reconstruct dense motion trajectories in both spatial and temporal
dimensions. Pre-trained with our MME paradigm, the model is able to anticipate
long-term and fine-grained motion details. Code is available at
https://github.com/XinyuSun/MME.
Related papers
- Self-supervised Amodal Video Object Segmentation [57.929357732733926]
Amodal perception requires inferring the full shape of an object that is partially occluded.
This paper develops a new framework of amodal Video object segmentation (SaVos)
arXiv Detail & Related papers (2022-10-23T14:09:35Z) - Self-supervised Video Representation Learning with Motion-Aware Masked
Autoencoders [46.38458873424361]
Masked autoencoders (MAEs) have emerged recently as art self-supervised representation learners.
In this work we present a motion-aware variant -- MotionMAE.
Our model is designed to additionally predict the corresponding motion structure information over time.
arXiv Detail & Related papers (2022-10-09T03:22:15Z) - Self-Supervised Video Representation Learning with Motion-Contrastive
Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet)
MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP)
Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - VIOLET : End-to-End Video-Language Transformers with Masked Visual-token
Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data.
We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z) - Self-supervised Motion Learning from Static Images [36.85209332144106]
Motion from Static Images (MoSI) learns to encode motion information.
MoSI can discover regions with large motion even without fine-tuning on the downstream datasets.
We demonstrate that MoSI can discover regions with large motion even without fine-tuning on the downstream datasets.
arXiv Detail & Related papers (2021-04-01T03:55:50Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.