SemanticMIM: Marring Masked Image Modeling with Semantics Compression for General Visual Representation
- URL: http://arxiv.org/abs/2406.10673v1
- Date: Sat, 15 Jun 2024 15:39:32 GMT
- Title: SemanticMIM: Marring Masked Image Modeling with Semantics Compression for General Visual Representation
- Authors: Yike Yuan, Huanzhang Dou, Fengjun Guo, Xi Li,
- Abstract summary: SemanticMIM is a framework to integrate the advantages of masked image modeling (MIM) and contrastive learning (CL) for general visual representation.
We conduct a thorough comparative analysis between CL and MIM, revealing that their complementary advantages stem from two distinct phases, i.e., compression and reconstruction.
We demonstrate that SemanticMIM effectively amalgamates the benefits of CL and MIM, leading to significant enhancement of performance and feature linear separability.
- Score: 13.013776924941205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper represents a neat yet effective framework, named SemanticMIM, to integrate the advantages of masked image modeling (MIM) and contrastive learning (CL) for general visual representation. We conduct a thorough comparative analysis between CL and MIM, revealing that their complementary advantages fundamentally stem from two distinct phases, i.e., compression and reconstruction. Specifically, SemanticMIM leverages a proxy architecture that customizes interaction between image and mask tokens, bridging these two phases to achieve general visual representation with the property of abundant semantic and positional awareness. Through extensive qualitative and quantitative evaluations, we demonstrate that SemanticMIM effectively amalgamates the benefits of CL and MIM, leading to significant enhancement of performance and feature linear separability. SemanticMIM also offers notable interpretability through attention response visualization. Codes are available at https://github.com/yyk-wew/SemanticMIM.
Related papers
- X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs [49.30255148577368]
X-Former is a lightweight transformer module designed to exploit the complementary strengths of CL and MIM.
X-Former first bootstraps vision-language representation learning and multimodal-to-multimodal generative learning from two frozen vision encoders.
It further bootstraps vision-to-language generative learning from a frozen LLM to ensure visual features from X-Former can be interpreted by the LLM.
arXiv Detail & Related papers (2024-07-18T18:39:54Z) - On the Role of Discrete Tokenization in Visual Representation Learning [35.10829554701771]
Masked image modeling (MIM) has gained popularity alongside contrastive learning methods.
discrete tokens as the reconstruction target, but the theoretical underpinnings of this choice remain underexplored.
We provide a comprehensive theoretical understanding on how discrete tokenization affects the model's generalization capabilities.
We propose a novel metric named TCAS, which is specifically designed to assess the effectiveness of discrete tokens within the MIM framework.
arXiv Detail & Related papers (2024-07-12T08:25:31Z) - Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language.
This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok)
SeTok groups visual features into semantic units via a dynamic clustering algorithm.
The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z) - Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language
Pre-training [87.69394953339238]
Masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment.
We propose a semantics-enhanced cross-modal MIM framework (SemMIM) for vision-language representation learning.
arXiv Detail & Related papers (2024-03-01T03:25:58Z) - CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding [38.53988682814626]
We propose a context-enhanced masked image modeling method (CtxMIM) for remote sensing image understanding.
CtxMIM formulates original image patches as a reconstructive template and employs a Siamese framework to operate on two sets of image patches.
With the simple and elegant design, CtxMIM encourages the pre-training model to learn object-level or pixel-level features on a large-scale dataset.
arXiv Detail & Related papers (2023-09-28T18:04:43Z) - i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable? [26.146459754995597]
Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training approach in the vision domain.
This paper aims to explore an interactive Masked Autoencoders (i-MAE) framework to enhance the representation capability.
In addition to qualitatively analyzing the characteristics of the latent representations, we examine the existence of linear separability and the degree of semantics in the latent space.
arXiv Detail & Related papers (2022-10-20T17:59:54Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Semantics-Depth-Symbiosis: Deeply Coupled Semi-Supervised Learning of
Semantics and Depth [83.94528876742096]
We tackle the MTL problem of two dense tasks, ie, semantic segmentation and depth estimation, and present a novel attention module called Cross-Channel Attention Module (CCAM)
In a true symbiotic spirit, we then formulate a novel data augmentation for the semantic segmentation task using predicted depth called AffineMix, and a simple depth augmentation using predicted semantics called ColorAug.
Finally, we validate the performance gain of the proposed method on the Cityscapes dataset, which helps us achieve state-of-the-art results for a semi-supervised joint model based on depth and semantic
arXiv Detail & Related papers (2022-06-21T17:40:55Z) - SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders [24.73294590182861]
Masked autoencoding (MAE) is different between vision and language.
Unlike words in NLP, the lack of semantic decomposition of images still makes MAE different between vision and language.
We propose a Semantic-Guided Masking strategy to integrate semantic information into the training process of MAE.
arXiv Detail & Related papers (2022-06-21T09:08:32Z) - Semantic Disentangling Generalized Zero-Shot Learning [50.259058462272435]
Generalized Zero-Shot Learning (GZSL) aims to recognize images from both seen and unseen categories.
In this paper, we propose a novel feature disentangling approach based on an encoder-decoder architecture.
The proposed model aims to distill quality semantic-consistent representations that capture intrinsic features of seen images.
arXiv Detail & Related papers (2021-01-20T05:46:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.