Related papers: Symmetric masking strategy enhances the performance of Masked Image Modeling

Symmetric masking strategy enhances the performance of Masked Image Modeling

URL: http://arxiv.org/abs/2408.12772v1
Date: Fri, 23 Aug 2024 00:15:43 GMT
Title: Symmetric masking strategy enhances the performance of Masked Image Modeling
Authors: Khanh-Binh Nguyen, Chae Jung Park,
Abstract summary: Masked Image Modeling (MIM) is a technique in self-supervised learning that focuses on acquiring detailed visual representations from unlabeled images. We propose a new masking strategy that effectively helps the model capture global and local features. Based on this masking strategy, SymMIM, our proposed training pipeline for MIM is introduced.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Masked Image Modeling (MIM) is a technique in self-supervised learning that focuses on acquiring detailed visual representations from unlabeled images by estimating the missing pixels in randomly masked sections. It has proven to be a powerful tool for the preliminary training of Vision Transformers (ViTs), yielding impressive results across various tasks. Nevertheless, most MIM methods heavily depend on the random masking strategy to formulate the pretext task. This strategy necessitates numerous trials to ascertain the optimal dropping ratio, which can be resource-intensive, requiring the model to be pre-trained for anywhere between 800 to 1600 epochs. Furthermore, this approach may not be suitable for all datasets. In this work, we propose a new masking strategy that effectively helps the model capture global and local features. Based on this masking strategy, SymMIM, our proposed training pipeline for MIM is introduced. SymMIM achieves a new SOTA accuracy of 85.9\% on ImageNet using ViT-Large and surpasses previous SOTA across downstream tasks such as image classification, semantic segmentation, object detection, instance segmentation tasks, and so on.

Related papers

Evolved Hierarchical Masking for Self-Supervised Learning [49.77271430882176]
Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training. This paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning.
arXiv Detail & Related papers (2025-04-12T09:40:14Z)
Semantic Refocused Tuning for Open-Vocabulary Panoptic Segmentation [42.020470627552136]
Open-vocabulary panoptic segmentation is an emerging task aiming to accurately segment the image into semantically meaningful masks. mask classification is the main performance bottleneck for open-vocab panoptic segmentation. We propose Semantic Refocused Tuning, a novel framework that greatly enhances open-vocab panoptic segmentation.
arXiv Detail & Related papers (2024-09-24T17:50:28Z)
ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders [53.3185750528969]
Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework. We introduce a data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise. We demonstrate our strategy's superiority in downstream tasks compared to random masking.
arXiv Detail & Related papers (2024-07-17T22:04:00Z)
Salience-Based Adaptive Masking: Revisiting Token Dynamics for Enhanced Pre-training [33.39585710223628]
Saliency-Based Adaptive Masking improves pre-training performance of MIM approaches by prioritizing token salience. We show that our method significantly improves over the state-of-the-art in mask-based pre-training on the ImageNet-1K dataset.
arXiv Detail & Related papers (2024-04-12T08:38:51Z)
MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks. We propose a single-stage and standalone method, MOCA, which unifies both desired properties. We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z)
Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data. We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process. In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z)
PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling [83.67628239775878]
Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT. This paper undertakes a fundamental analysis of MIM from the perspective of pixel reconstruction. We propose a remarkably simple and effective method, ourmethod, that entails two strategies.
arXiv Detail & Related papers (2023-03-04T13:38:51Z)
Efficient Masked Autoencoders with Self-Consistency [34.7076436760695]
Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training method in computer vision. We propose efficient masked autoencoders with self-consistency (EMAE) to improve the pre-training efficiency. EMAE consistently obtains state-of-the-art transfer ability on a variety of downstream tasks, such as image classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2023-02-28T09:21:12Z)
Exploring the Coordination of Frequency and Attention in Masked Image Modeling [28.418445136155512]
Masked image modeling (MIM) has dominated self-supervised learning in computer vision. We propose the Frequency & Attention-driven Masking and Throwing Strategy (FAMT), which can extract semantic patches and reduce the number of training patches. FAMT can be seamlessly integrated as a plug-and-play module and surpasses previous works.
arXiv Detail & Related papers (2022-11-28T14:38:19Z)
Adversarial Masking for Self-Supervised Learning [81.25999058340997]
Masked image model (MIM) framework for self-supervised learning, ADIOS, is proposed. It simultaneously learns a masking function and an image encoder using an adversarial objective. It consistently improves on state-of-the-art self-supervised learning (SSL) methods on a variety of tasks and datasets.
arXiv Detail & Related papers (2022-01-31T10:23:23Z)
MST: Masked Self-Supervised Transformer for Visual Representation [52.099722121603506]
Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) We present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image. MST achieves Top-1 accuracy of 76.9% with DeiT-S only using 300-epoch pre-training by linear evaluation.
arXiv Detail & Related papers (2021-06-10T11:05:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.