Masked Autoencoders As Spatiotemporal Learners
- URL: http://arxiv.org/abs/2205.09113v1
- Date: Wed, 18 May 2022 17:59:59 GMT
- Title: Masked Autoencoders As Spatiotemporal Learners
- Authors: Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, Kaiming He
- Abstract summary: This paper studies a conceptually simple extension of Masked Autoencoders (MAE) totemporal representation learning from videos.
We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels.
We observe that the optimal masking ratio is as high as 90% (vs. 75% on images), supporting the hypothesis that this ratio is related to information redundancy of the data.
- Score: 60.83955416682043
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies a conceptually simple extension of Masked Autoencoders
(MAE) to spatiotemporal representation learning from videos. We randomly mask
out spacetime patches in videos and learn an autoencoder to reconstruct them in
pixels. Interestingly, we show that our MAE method can learn strong
representations with almost no inductive bias on spacetime (only except for
patch and positional embeddings), and spacetime-agnostic random masking
performs the best. We observe that the optimal masking ratio is as high as 90%
(vs. 75% on images), supporting the hypothesis that this ratio is related to
information redundancy of the data. A high masking ratio leads to a large
speedup, e.g., > 4x in wall-clock time or even more. We report competitive
results on several challenging video datasets using vanilla Vision
Transformers. We observe that MAE can outperform supervised pre-training by
large margins. We further report encouraging results of training on real-world,
uncurated Instagram data. Our study suggests that the general framework of
masked autoencoding (BERT, MAE, etc.) can be a unified methodology for
representation learning with minimal domain knowledge.
Related papers
- Downstream Task Guided Masking Learning in Masked Autoencoders Using
Multi-Level Optimization [42.82742477950748]
Masked Autoencoder (MAE) is a notable method for self-supervised pretraining in visual representation learning.
We introduce the Multi-level Optimized Mask Autoencoder (MLO-MAE), a novel framework that learns an optimal masking strategy during pretraining.
Our experimental findings highlight MLO-MAE's significant advancements in visual representation learning.
arXiv Detail & Related papers (2024-02-28T07:37:26Z) - Motion-Guided Masking for Spatiotemporal Representation Learning [16.9547105658246]
We propose a motion-guided masking algorithm (MGM) which leverages motion vectors to guide the position of each mask over time.
On two challenging large-scale video benchmarks, we equip video MAE with our MGM and achieve up to +$1.3%$ improvement compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2023-08-24T17:58:04Z) - MGMAE: Motion Guided Masking for Video Masked Autoencoding [34.80832206608387]
Temporal redundancy has led to a high masking ratio and customized masking strategy in VideoMAE.
Our motion guided masking explicitly incorporates motion information to build temporal consistent masking volume.
We perform experiments on the datasets of Something-Something V2 and Kinetics-400, demonstrating the superior performance of our MGMAE to the original VideoMAE.
arXiv Detail & Related papers (2023-08-21T15:39:41Z) - DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking
Tasks [76.24996889649744]
Masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS)
We propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos.
Our model sets new state-of-the-art performance on 8 out of 9 highly competitive video tracking and segmentation datasets.
arXiv Detail & Related papers (2023-04-02T16:40:42Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens [57.354304637367555]
We present EVEREST, a surprisingly efficient MVA approach for video representation learning.
It finds tokens containing rich motion features and discards uninformative ones during both pre-training and fine-tuning.
Our method significantly reduces the computation and memory requirements of MVA.
arXiv Detail & Related papers (2022-11-19T09:57:01Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z) - VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive
Learning [82.09856883441044]
Video understanding relies on perceiving the global content modeling its internal connections.
We propose a block-wise strategy where we mask neighboring video tokens in both spatial and temporal domains.
We also add an augmentation-free contrastive learning method to further capture global content.
arXiv Detail & Related papers (2021-06-21T16:48:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.