AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with
Masked Autoencoders
- URL: http://arxiv.org/abs/2211.09120v1
- Date: Wed, 16 Nov 2022 18:59:48 GMT
- Title: AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with
Masked Autoencoders
- Authors: Wele Gedara Chaminda Bandara, Naman Patel, Ali Gholami, Mehdi Nikkhah,
Motilal Agrawal, Vishal M. Patel
- Abstract summary: Masked Autoencoders learn general representations for image, text, audio, video, etc., by masked input data from tokens of the visible data.
This paper proposes an adaptive masking strategy for MAEs that is end-to-end trainable.
AdaMAE samples visible tokens based on the semantic context using an auxiliary sampling network.
- Score: 44.87786478095987
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked Autoencoders (MAEs) learn generalizable representations for image,
text, audio, video, etc., by reconstructing masked input data from tokens of
the visible data. Current MAE approaches for videos rely on random patch, tube,
or frame-based masking strategies to select these tokens. This paper proposes
AdaMAE, an adaptive masking strategy for MAEs that is end-to-end trainable. Our
adaptive masking strategy samples visible tokens based on the semantic context
using an auxiliary sampling network. This network estimates a categorical
distribution over spacetime-patch tokens. The tokens that increase the expected
reconstruction error are rewarded and selected as visible tokens, motivated by
the policy gradient algorithm in reinforcement learning. We show that AdaMAE
samples more tokens from the high spatiotemporal information regions, thereby
allowing us to mask 95% of tokens, resulting in lower memory requirements and
faster pre-training. We conduct ablation studies on the Something-Something v2
(SSv2) dataset to demonstrate the efficacy of our adaptive sampling approach
and report state-of-the-art results of 70.0% and 81.7% in top-1 accuracy on
SSv2 and Kinetics-400 action classification datasets with a ViT-Base backbone
and 800 pre-training epochs.
Related papers
- Semantic Refocused Tuning for Open-Vocabulary Panoptic Segmentation [42.020470627552136]
Open-vocabulary panoptic segmentation is an emerging task aiming to accurately segment the image into semantically meaningful masks.
mask classification is the main performance bottleneck for open-vocab panoptic segmentation.
We propose Semantic Refocused Tuning, a novel framework that greatly enhances open-vocab panoptic segmentation.
arXiv Detail & Related papers (2024-09-24T17:50:28Z) - ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders [53.3185750528969]
Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework.
We introduce a data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise.
We demonstrate our strategy's superiority in downstream tasks compared to random masking.
arXiv Detail & Related papers (2024-07-17T22:04:00Z) - Downstream Task Guided Masking Learning in Masked Autoencoders Using
Multi-Level Optimization [42.82742477950748]
Masked Autoencoder (MAE) is a notable method for self-supervised pretraining in visual representation learning.
We introduce the Multi-level Optimized Mask Autoencoder (MLO-MAE), a novel framework that learns an optimal masking strategy during pretraining.
Our experimental findings highlight MLO-MAE's significant advancements in visual representation learning.
arXiv Detail & Related papers (2024-02-28T07:37:26Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - SdAE: Self-distillated Masked Autoencoder [95.3684955370897]
Self-distillated masked AutoEncoder network SdAE is proposed in this paper.
With only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification.
arXiv Detail & Related papers (2022-07-31T15:07:25Z) - Data Efficient Masked Language Modeling for Vision and Language [16.95631509102115]
Masked language modeling (MLM) is one of the key sub-tasks in vision-language training.
In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text.
We investigate a range of alternative masking strategies specific to the cross-modal setting that address these shortcomings.
arXiv Detail & Related papers (2021-09-05T11:27:53Z) - VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive
Learning [82.09856883441044]
Video understanding relies on perceiving the global content modeling its internal connections.
We propose a block-wise strategy where we mask neighboring video tokens in both spatial and temporal domains.
We also add an augmentation-free contrastive learning method to further capture global content.
arXiv Detail & Related papers (2021-06-21T16:48:19Z) - MST: Masked Self-Supervised Transformer for Visual Representation [52.099722121603506]
Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP)
We present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image.
MST achieves Top-1 accuracy of 76.9% with DeiT-S only using 300-epoch pre-training by linear evaluation.
arXiv Detail & Related papers (2021-06-10T11:05:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.