i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable?
- URL: http://arxiv.org/abs/2210.11470v2
- Date: Mon, 8 Apr 2024 22:01:32 GMT
- Title: i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable?
- Authors: Kevin Zhang, Zhiqiang Shen,
- Abstract summary: Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training approach in the vision domain.
This paper aims to explore an interactive Masked Autoencoders (i-MAE) framework to enhance the representation capability.
In addition to qualitatively analyzing the characteristics of the latent representations, we examine the existence of linear separability and the degree of semantics in the latent space.
- Score: 26.146459754995597
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training approach in the vision domain. However, the mechanism and properties of the learned representations by such a scheme, as well as how to further enhance the representations are so far not well-explored. In this paper, we aim to explore an interactive Masked Autoencoders (i-MAE) framework to enhance the representation capability from two aspects: (1) employing a two-way image reconstruction and a latent feature reconstruction with distillation loss to learn better features; (2) proposing a semantics-enhanced sampling strategy to boost the learned semantics in MAE. Upon the proposed i-MAE architecture, we can address two critical questions to explore the behaviors of the learned representations in MAE: (1) Whether the separability of latent representations in Masked Autoencoders is helpful for model performance? We study it by forcing the input as a mixture of two images instead of one. (2) Whether we can enhance the representations in the latent feature space by controlling the degree of semantics during sampling on Masked Autoencoders? To this end, we propose a sampling strategy within a mini-batch based on the semantics of training samples to examine this aspect. Extensive experiments are conducted on CIFAR-10/100, Tiny-ImageNet and ImageNet-1K to verify the observations we discovered. Furthermore, in addition to qualitatively analyzing the characteristics of the latent representations, we examine the existence of linear separability and the degree of semantics in the latent space by proposing two evaluation schemes. The surprising and consistent results demonstrate that i-MAE is a superior framework design for understanding MAE frameworks, as well as achieving better representational ability. Code is available at https://github.com/vision-learning-acceleration-lab/i-mae.
Related papers
- Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - SemanticMIM: Marring Masked Image Modeling with Semantics Compression for General Visual Representation [13.013776924941205]
SemanticMIM is a framework to integrate the advantages of masked image modeling (MIM) and contrastive learning (CL) for general visual representation.
We conduct a thorough comparative analysis between CL and MIM, revealing that their complementary advantages stem from two distinct phases, i.e., compression and reconstruction.
We demonstrate that SemanticMIM effectively amalgamates the benefits of CL and MIM, leading to significant enhancement of performance and feature linear separability.
arXiv Detail & Related papers (2024-06-15T15:39:32Z) - Self-Supervised Learning for Visual Relationship Detection through
Masked Bounding Box Reconstruction [6.798515070856465]
We present a novel self-supervised approach for representation learning, particularly for the task of Visual Relationship Detection (VRD)
Motivated by the effectiveness of Masked Image Modeling (MIM), we propose Masked Bounding Box Reconstruction (MBBR)
arXiv Detail & Related papers (2023-11-08T16:59:26Z) - Understanding Masked Autoencoders From a Local Contrastive Perspective [80.57196495601826]
Masked AutoEncoder (MAE) has revolutionized the field of self-supervised learning with its simple yet effective masking and reconstruction strategies.
We introduce a new empirical framework, called Local Contrastive MAE, to analyze both reconstructive and contrastive aspects of MAE.
arXiv Detail & Related papers (2023-10-03T12:08:15Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - R-MAE: Regions Meet Masked Autoencoders [113.73147144125385]
We explore regions as a potential visual analogue of words for self-supervised image representation learning.
Inspired by Masked Autoencoding (MAE), a generative pre-training baseline, we propose masked region autoencoding to learn from groups of pixels or regions.
arXiv Detail & Related papers (2023-06-08T17:56:46Z) - SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders [24.73294590182861]
Masked autoencoding (MAE) is different between vision and language.
Unlike words in NLP, the lack of semantic decomposition of images still makes MAE different between vision and language.
We propose a Semantic-Guided Masking strategy to integrate semantic information into the training process of MAE.
arXiv Detail & Related papers (2022-06-21T09:08:32Z) - An Empirical Investigation of Representation Learning for Imitation [76.48784376425911]
Recent work in vision, reinforcement learning, and NLP has shown that auxiliary representation learning objectives can reduce the need for large amounts of expensive, task-specific data.
We propose a modular framework for constructing representation learning algorithms, then use our framework to evaluate the utility of representation learning for imitation.
arXiv Detail & Related papers (2022-05-16T11:23:42Z) - Semantic Representation and Dependency Learning for Multi-Label Image
Recognition [76.52120002993728]
We propose a novel and effective semantic representation and dependency learning (SRDL) framework to learn category-specific semantic representation for each category.
Specifically, we design a category-specific attentional regions (CAR) module to generate channel/spatial-wise attention matrices to guide model.
We also design an object erasing (OE) module to implicitly learn semantic dependency among categories by erasing semantic-aware regions.
arXiv Detail & Related papers (2022-04-08T00:55:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.