Related papers: AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

URL: http://arxiv.org/abs/2211.09120v1
Date: Wed, 16 Nov 2022 18:59:48 GMT
Title: AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders
Authors: Wele Gedara Chaminda Bandara, Naman Patel, Ali Gholami, Mehdi Nikkhah, Motilal Agrawal, Vishal M. Patel
Abstract summary: Masked Autoencoders learn general representations for image, text, audio, video, etc., by masked input data from tokens of the visible data. This paper proposes an adaptive masking strategy for MAEs that is end-to-end trainable. AdaMAE samples visible tokens based on the semantic context using an auxiliary sampling network.
Score: 44.87786478095987
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Masked Autoencoders (MAEs) learn generalizable representations for image, text, audio, video, etc., by reconstructing masked input data from tokens of the visible data. Current MAE approaches for videos rely on random patch, tube, or frame-based masking strategies to select these tokens. This paper proposes AdaMAE, an adaptive masking strategy for MAEs that is end-to-end trainable. Our adaptive masking strategy samples visible tokens based on the semantic context using an auxiliary sampling network. This network estimates a categorical distribution over spacetime-patch tokens. The tokens that increase the expected reconstruction error are rewarded and selected as visible tokens, motivated by the policy gradient algorithm in reinforcement learning. We show that AdaMAE samples more tokens from the high spatiotemporal information regions, thereby allowing us to mask 95% of tokens, resulting in lower memory requirements and faster pre-training. We conduct ablation studies on the Something-Something v2 (SSv2) dataset to demonstrate the efficacy of our adaptive sampling approach and report state-of-the-art results of 70.0% and 81.7% in top-1 accuracy on SSv2 and Kinetics-400 action classification datasets with a ViT-Base backbone and 800 pre-training epochs.

Related papers

Know Your Attention Maps: Class-specific Token Masking for Weakly Supervised Semantic Segmentation [5.824064631226058]
We propose an end-to-end method that directly utilizes the attention maps learned by a Transformer Vision (ViT) for Weakly Supervised Semantics (WSSS)<n>At inference time, we aggregate the different self-attention maps of each [] token corresponding to the predicted labels to generate pseudo segmentation masks.
arXiv Detail & Related papers (2025-07-09T13:53:34Z)
Reinforcement Learning meets Masked Video Modeling : Trajectory-Guided Adaptive Token Selection [12.421172561756473]
We introduce a novel and generalizable Trajectory-Aware Token Sampler (TATS)<n>TATS models the motion dynamics of tokens and can be seamlessly integrated into the masked autoencoder framework.<n>We show that our model allows aggressive masking without compromising performance on the downstream task of action recognition.
arXiv Detail & Related papers (2025-05-13T13:35:41Z)
Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text [27.320746607958142]
Masked language modeling has become a widely adopted unsupervised technique to pre-train language models. We propose to adjust the masking ratio and to decide which tokens to mask based on a novel task-informed anti-curriculum learning scheme.
arXiv Detail & Related papers (2025-02-18T15:36:16Z)
Semantic Refocused Tuning for Open-Vocabulary Panoptic Segmentation [42.020470627552136]
Open-vocabulary panoptic segmentation is an emerging task aiming to accurately segment the image into semantically meaningful masks. mask classification is the main performance bottleneck for open-vocab panoptic segmentation. We propose Semantic Refocused Tuning, a novel framework that greatly enhances open-vocab panoptic segmentation.
arXiv Detail & Related papers (2024-09-24T17:50:28Z)
ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders [53.3185750528969]
Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework. We introduce a data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise. We demonstrate our strategy's superiority in downstream tasks compared to random masking.
arXiv Detail & Related papers (2024-07-17T22:04:00Z)
Downstream Task Guided Masking Learning in Masked Autoencoders Using Multi-Level Optimization [42.82742477950748]
Masked Autoencoder (MAE) is a notable method for self-supervised pretraining in visual representation learning. We introduce the Multi-level Optimized Mask Autoencoder (MLO-MAE), a novel framework that learns an optimal masking strategy during pretraining. Our experimental findings highlight MLO-MAE's significant advancements in visual representation learning.
arXiv Detail & Related papers (2024-02-28T07:37:26Z)
Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data. We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process. In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z)
SdAE: Self-distillated Masked Autoencoder [95.3684955370897]
Self-distillated masked AutoEncoder network SdAE is proposed in this paper. With only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification.
arXiv Detail & Related papers (2022-07-31T15:07:25Z)
Data Efficient Masked Language Modeling for Vision and Language [16.95631509102115]
Masked language modeling (MLM) is one of the key sub-tasks in vision-language training. In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text. We investigate a range of alternative masking strategies specific to the cross-modal setting that address these shortcomings.
arXiv Detail & Related papers (2021-09-05T11:27:53Z)
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning [82.09856883441044]
Video understanding relies on perceiving the global content modeling its internal connections. We propose a block-wise strategy where we mask neighboring video tokens in both spatial and temporal domains. We also add an augmentation-free contrastive learning method to further capture global content.
arXiv Detail & Related papers (2021-06-21T16:48:19Z)
MST: Masked Self-Supervised Transformer for Visual Representation [52.099722121603506]
Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) We present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image. MST achieves Top-1 accuracy of 76.9% with DeiT-S only using 300-epoch pre-training by linear evaluation.
arXiv Detail & Related papers (2021-06-10T11:05:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.