MGMAE: Motion Guided Masking for Video Masked Autoencoding
- URL: http://arxiv.org/abs/2308.10794v1
- Date: Mon, 21 Aug 2023 15:39:41 GMT
- Title: MGMAE: Motion Guided Masking for Video Masked Autoencoding
- Authors: Bingkun Huang, Zhiyu Zhao, Guozhen Zhang, Yu Qiao and Limin Wang
- Abstract summary: Temporal redundancy has led to a high masking ratio and customized masking strategy in VideoMAE.
Our motion guided masking explicitly incorporates motion information to build temporal consistent masking volume.
We perform experiments on the datasets of Something-Something V2 and Kinetics-400, demonstrating the superior performance of our MGMAE to the original VideoMAE.
- Score: 34.80832206608387
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked autoencoding has shown excellent performance on self-supervised video
representation learning. Temporal redundancy has led to a high masking ratio
and customized masking strategy in VideoMAE. In this paper, we aim to further
improve the performance of video masked autoencoding by introducing a motion
guided masking strategy. Our key insight is that motion is a general and unique
prior in video, which should be taken into account during masked pre-training.
Our motion guided masking explicitly incorporates motion information to build
temporal consistent masking volume. Based on this masking volume, we can track
the unmasked tokens in time and sample a set of temporal consistent cubes from
videos. These temporal aligned unmasked tokens will further relieve the
information leakage issue in time and encourage the MGMAE to learn more useful
structure information. We implement our MGMAE with an online efficient optical
flow estimator and backward masking map warping strategy. We perform
experiments on the datasets of Something-Something V2 and Kinetics-400,
demonstrating the superior performance of our MGMAE to the original VideoMAE.
In addition, we provide the visualization analysis to illustrate that our MGMAE
can sample temporal consistent cubes in a motion-adaptive manner for more
effective video pre-training.
Related papers
- FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing [22.876290778155514]
Cross-attention masks are effective in video editing but can introduce artifacts such as blurring and flickering.
We propose FreeMask, a method for selecting optimal masks tailored to specific video editing tasks.
Our approach achieves superior semantic fidelity, temporal consistency, and editing quality compared to state-of-the-art methods.
arXiv Detail & Related papers (2024-09-30T17:01:26Z) - Text-Guided Video Masked Autoencoder [12.321239366215426]
We introduce a novel text-guided masking algorithm (TGM) that masks the video regions with highest correspondence to paired captions.
We show that across existing masking algorithms, unifying MAE and masked video-text contrastive learning improves downstream performance compared to pure MAE.
arXiv Detail & Related papers (2024-08-01T17:58:19Z) - Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders [89.12558126877532]
We propose CropMAE, an alternative approach to the Siamese pre-training introduced by SiamMAE.
Our method exclusively considers pairs of cropped images sourced from the same image but cropped differently, deviating from the conventional pairs of frames extracted from a video.
CropMAE achieves the highest masking ratio to date (98.5%), enabling the reconstruction of images using only two visible patches.
arXiv Detail & Related papers (2024-03-26T16:04:19Z) - Motion-Guided Masking for Spatiotemporal Representation Learning [16.9547105658246]
We propose a motion-guided masking algorithm (MGM) which leverages motion vectors to guide the position of each mask over time.
On two challenging large-scale video benchmarks, we equip video MAE with our MGM and achieve up to +$1.3%$ improvement compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2023-08-24T17:58:04Z) - Siamese Masked Autoencoders [76.35448665609998]
We present Siamese Masked Autoencoders (SiamMAE) for learning visual correspondence from videos.
SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them.
It outperforms state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks.
arXiv Detail & Related papers (2023-05-23T17:59:46Z) - Masked Motion Encoding for Self-Supervised Video Representation Learning [84.24773072241945]
We present Masked Motion MME, a new pre-training paradigm that reconstructs both appearance and motion information to explore temporal clues.
Motivated by the fact that human is able to recognize an action by tracking objects' position changes and shape changes, we propose to reconstruct a motion trajectory that represents these two kinds of change in the masked regions.
Pre-trained with our MME paradigm, the model is able to anticipate long-term and fine-grained motion details.
arXiv Detail & Related papers (2022-10-12T11:19:55Z) - Self-supervised Video Representation Learning with Motion-Aware Masked
Autoencoders [46.38458873424361]
Masked autoencoders (MAEs) have emerged recently as art self-supervised representation learners.
In this work we present a motion-aware variant -- MotionMAE.
Our model is designed to additionally predict the corresponding motion structure information over time.
arXiv Detail & Related papers (2022-10-09T03:22:15Z) - Video Mask Transfiner for High-Quality Video Instance Segmentation [102.50936366583106]
Video Mask Transfiner (VMT) is capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure.
Based on our VMT architecture, we design an automated annotation refinement approach by iterative training and self-correction.
We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS.
arXiv Detail & Related papers (2022-07-28T11:13:37Z) - Masked Autoencoders As Spatiotemporal Learners [60.83955416682043]
This paper studies a conceptually simple extension of Masked Autoencoders (MAE) totemporal representation learning from videos.
We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels.
We observe that the optimal masking ratio is as high as 90% (vs. 75% on images), supporting the hypothesis that this ratio is related to information redundancy of the data.
arXiv Detail & Related papers (2022-05-18T17:59:59Z) - Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in
Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos.
We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks.
The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.