SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders
- URL: http://arxiv.org/abs/2206.10207v1
- Date: Tue, 21 Jun 2022 09:08:32 GMT
- Title: SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders
- Authors: Gang Li, Heliang Zheng, Daqing Liu, Bing Su, Changwen Zheng
- Abstract summary: Masked autoencoding (MAE) is different between vision and language.
Unlike words in NLP, the lack of semantic decomposition of images still makes MAE different between vision and language.
We propose a Semantic-Guided Masking strategy to integrate semantic information into the training process of MAE.
- Score: 24.73294590182861
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, significant progress has been made in masked image modeling to
catch up to masked language modeling. However, unlike words in NLP, the lack of
semantic decomposition of images still makes masked autoencoding (MAE)
different between vision and language. In this paper, we explore a potential
visual analogue of words, i.e., semantic parts, and we integrate semantic
information into the training process of MAE by proposing a Semantic-Guided
Masking strategy. Compared to widely adopted random masking, our masking
strategy can gradually guide the network to learn various information, i.e.,
from intra-part patterns to inter-part relations. In particular, we achieve
this in two steps. 1) Semantic part learning: we design a self-supervised part
learning method to obtain semantic parts by leveraging and refining the
multi-head attention of a ViT-based encoder. 2) Semantic-guided MAE (SemMAE)
training: we design a masking strategy that varies from masking a portion of
patches in each part to masking a portion of (whole) parts in an image.
Extensive experiments on various vision tasks show that SemMAE can learn better
image representation by integrating semantic information. In particular, SemMAE
achieves 84.5% fine-tuning accuracy on ImageNet-1k, which outperforms the
vanilla MAE by 1.4%. In the semantic segmentation and fine-grained recognition
tasks, SemMAE also brings significant improvements and yields the
state-of-the-art performance.
Related papers
- CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding [38.53988682814626]
We propose a context-enhanced masked image modeling method (CtxMIM) for remote sensing image understanding.
CtxMIM formulates original image patches as a reconstructive template and employs a Siamese framework to operate on two sets of image patches.
With the simple and elegant design, CtxMIM encourages the pre-training model to learn object-level or pixel-level features on a large-scale dataset.
arXiv Detail & Related papers (2023-09-28T18:04:43Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable? [26.146459754995597]
Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training approach in the vision domain.
This paper aims to explore an interactive Masked Autoencoders (i-MAE) framework to enhance the representation capability.
In addition to qualitatively analyzing the characteristics of the latent representations, we examine the existence of linear separability and the degree of semantics in the latent space.
arXiv Detail & Related papers (2022-10-20T17:59:54Z) - Learning Hierarchical Image Segmentation For Recognition and By Recognition [39.712584686731574]
We propose to integrate a hierarchical segmenter into the recognition process, train and adapt the entire model solely on image-level recognition objectives.
We learn hierarchical segmentation for free alongside recognition, automatically uncovering part-to-whole relationships that not only underpin but also enhance recognition.
Notably, our model (trained on unlabeled 1M ImageNet images) outperforms SAM (trained on 11M images masks) by absolute 8% in mIoU on PartImageNet object segmentation.
arXiv Detail & Related papers (2022-10-01T16:31:44Z) - MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
Pretraining [138.86293836634323]
MaskCLIP incorporates a newly proposed masked self-distillation into contrastive language-image pretraining.
MaskCLIP achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder.
arXiv Detail & Related papers (2022-08-25T17:59:58Z) - Masked Vision and Language Modeling for Multi-modal Representation
Learning [62.15254888833132]
We study how to use masked signal modeling in vision and language (V+L) representation learning.
We propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality.
Our experiments on various V+L tasks show that the proposed method achieves state-of-the-art performances by using a large amount of data.
arXiv Detail & Related papers (2022-08-03T15:11:01Z) - Adversarial Masking for Self-Supervised Learning [81.25999058340997]
Masked image model (MIM) framework for self-supervised learning, ADIOS, is proposed.
It simultaneously learns a masking function and an image encoder using an adversarial objective.
It consistently improves on state-of-the-art self-supervised learning (SSL) methods on a variety of tasks and datasets.
arXiv Detail & Related papers (2022-01-31T10:23:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.