Motion-Guided Masking for Spatiotemporal Representation Learning
- URL: http://arxiv.org/abs/2308.12962v1
- Date: Thu, 24 Aug 2023 17:58:04 GMT
- Title: Motion-Guided Masking for Spatiotemporal Representation Learning
- Authors: David Fan, Jue Wang, Shuai Liao, Yi Zhu, Vimal Bhat, Hector
Santos-Villalobos, Rohith MV, Xinyu Li
- Abstract summary: We propose a motion-guided masking algorithm (MGM) which leverages motion vectors to guide the position of each mask over time.
On two challenging large-scale video benchmarks, we equip video MAE with our MGM and achieve up to +$1.3%$ improvement compared to previous state-of-the-art methods.
- Score: 16.9547105658246
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Several recent works have directly extended the image masked autoencoder
(MAE) with random masking into video domain, achieving promising results.
However, unlike images, both spatial and temporal information are important for
video understanding. This suggests that the random masking strategy that is
inherited from the image MAE is less effective for video MAE. This motivates
the design of a novel masking algorithm that can more efficiently make use of
video saliency. Specifically, we propose a motion-guided masking algorithm
(MGM) which leverages motion vectors to guide the position of each mask over
time. Crucially, these motion-based correspondences can be directly obtained
from information stored in the compressed format of the video, which makes our
method efficient and scalable. On two challenging large-scale video benchmarks
(Kinetics-400 and Something-Something V2), we equip video MAE with our MGM and
achieve up to +$1.3\%$ improvement compared to previous state-of-the-art
methods. Additionally, our MGM achieves equivalent performance to previous
video MAE using up to $66\%$ fewer training epochs. Lastly, we show that MGM
generalizes better to downstream transfer learning and domain adaptation tasks
on the UCF101, HMDB51, and Diving48 datasets, achieving up to +$4.9\%$
improvement compared to baseline methods.
Related papers
- Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders [89.12558126877532]
We propose CropMAE, an alternative approach to the Siamese pre-training introduced by SiamMAE.
Our method exclusively considers pairs of cropped images sourced from the same image but cropped differently, deviating from the conventional pairs of frames extracted from a video.
CropMAE achieves the highest masking ratio to date (98.5%), enabling the reconstruction of images using only two visible patches.
arXiv Detail & Related papers (2024-03-26T16:04:19Z) - MGMAE: Motion Guided Masking for Video Masked Autoencoding [34.80832206608387]
Temporal redundancy has led to a high masking ratio and customized masking strategy in VideoMAE.
Our motion guided masking explicitly incorporates motion information to build temporal consistent masking volume.
We perform experiments on the datasets of Something-Something V2 and Kinetics-400, demonstrating the superior performance of our MGMAE to the original VideoMAE.
arXiv Detail & Related papers (2023-08-21T15:39:41Z) - DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking
Tasks [76.24996889649744]
Masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS)
We propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos.
Our model sets new state-of-the-art performance on 8 out of 9 highly competitive video tracking and segmentation datasets.
arXiv Detail & Related papers (2023-04-02T16:40:42Z) - It Takes Two: Masked Appearance-Motion Modeling for Self-supervised
Video Transformer Pre-training [76.69480467101143]
Self-supervised video transformer pre-training has recently benefited from the mask-and-predict pipeline.
We explicitly investigate motion cues in videos as extra prediction target and propose our Masked Appearance-Motion Modeling framework.
Our method learns generalized video representations and achieves 82.3% on Kinects-400, 71.3% on Something-Something V2, 91.5% on UCF101, and 62.5% on HMDB51.
arXiv Detail & Related papers (2022-10-11T08:05:18Z) - Self-supervised Video Representation Learning with Motion-Aware Masked
Autoencoders [46.38458873424361]
Masked autoencoders (MAEs) have emerged recently as art self-supervised representation learners.
In this work we present a motion-aware variant -- MotionMAE.
Our model is designed to additionally predict the corresponding motion structure information over time.
arXiv Detail & Related papers (2022-10-09T03:22:15Z) - Masked Autoencoders As Spatiotemporal Learners [60.83955416682043]
This paper studies a conceptually simple extension of Masked Autoencoders (MAE) totemporal representation learning from videos.
We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels.
We observe that the optimal masking ratio is as high as 90% (vs. 75% on images), supporting the hypothesis that this ratio is related to information redundancy of the data.
arXiv Detail & Related papers (2022-05-18T17:59:59Z) - Self-Supervised Video Object Segmentation by Motion-Aware Mask
Propagation [52.8407961172098]
We coined a self-supervised motion-aware matching method for semi-supervised video object segmentation.
We show that MAMP achieves state-of-the-art performance with stronger generalization ability compared to existing self-supervised methods.
arXiv Detail & Related papers (2021-07-27T03:07:56Z) - Space-Time Crop & Attend: Improving Cross-modal Video Representation
Learning [88.71867887257274]
We show that spatial augmentations such as cropping work well for videos too, but that previous implementations could not do this at a scale sufficient for it to work well.
To address this issue, we first introduce Feature Crop, a method to simulate such augmentations much more efficiently directly in feature space.
Second, we show that as opposed to naive average pooling, the use of transformer-based attention performance improves significantly.
arXiv Detail & Related papers (2021-03-18T12:32:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.