Beyond [cls]: Exploring the true potential of Masked Image Modeling representations
- URL: http://arxiv.org/abs/2412.03215v1
- Date: Wed, 04 Dec 2024 11:08:32 GMT
- Title: Beyond [cls]: Exploring the true potential of Masked Image Modeling representations
- Authors: Marcin Przewięźlikowski, Randall Balestriero, Wojciech Jasiński, Marek Śmieja, Bartosz Zieliński,
- Abstract summary: Masked Image Modeling (MIM) has emerged as a popular method for Self-Supervised Learning (SSL) of visual representations.
For high-level perception tasks, MIM-pretrained models offer lower out-of-the-box representation quality than the Joint-Embedding Architectures (JEA)
We reveal that whereas JEAs construct their representation on a selected set of relevant image fragments, MIM models aggregate nearly whole image content.
- Score: 10.800240155402417
- License:
- Abstract: Masked Image Modeling (MIM) has emerged as a popular method for Self-Supervised Learning (SSL) of visual representations. However, for high-level perception tasks, MIM-pretrained models offer lower out-of-the-box representation quality than the Joint-Embedding Architectures (JEA) - another prominent SSL paradigm. To understand this performance gap, we analyze the information flow in Vision Transformers (ViT) learned by both approaches. We reveal that whereas JEAs construct their representation on a selected set of relevant image fragments, MIM models aggregate nearly whole image content. Moreover, we demonstrate that MIM-trained ViTs retain valuable information within their patch tokens, which is not effectively captured by the global [cls] token representations. Therefore, selective aggregation of relevant patch tokens, without any fine-tuning, results in consistently higher-quality of MIM representations. To our knowledge, we are the first to highlight the lack of effective representation aggregation as an emergent issue of MIM and propose directions to address it, contributing to future advances in Self-Supervised Learning.
Related papers
- What's in the Image? A Deep-Dive into the Vision of Vision Language Models [20.669971132114195]
Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in comprehending complex visual content.
In this paper, we conduct a thorough empirical analysis, focusing on attention modules across layers.
We reveal several key insights about how these models process visual data.
arXiv Detail & Related papers (2024-11-26T14:59:06Z) - Semantic Alignment for Multimodal Large Language Models [72.10272479476161]
We introduce Semantic Alignment for Multi-modal large language models (SAM)
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
arXiv Detail & Related papers (2024-08-23T06:48:46Z) - Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models [49.439311430360284]
We introduce a novel data synthesis method inspired by contrastive learning and image difference captioning.
Our key idea involves challenging the model to discern both matching and distinct elements.
We leverage this generated dataset to fine-tune state-of-the-art (SOTA) MLLMs.
arXiv Detail & Related papers (2024-08-08T17:10:16Z) - Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning [18.424840375721303]
Masked Image Modeling (MIM) has emerged as a promising method for deriving visual representations from unlabeled image data by predicting missing pixels from masked portions of images.
A promising yet unrealized framework is learning representations through masked reconstruction in latent space, combining the locality of MIM with the high-level targets.
This study is among the first to thoroughly analyze and address the challenges of such framework, which we refer to as Latent MIM.
arXiv Detail & Related papers (2024-07-22T17:54:41Z) - On the Role of Discrete Tokenization in Visual Representation Learning [35.10829554701771]
Masked image modeling (MIM) has gained popularity alongside contrastive learning methods.
discrete tokens as the reconstruction target, but the theoretical underpinnings of this choice remain underexplored.
We provide a comprehensive theoretical understanding on how discrete tokenization affects the model's generalization capabilities.
We propose a novel metric named TCAS, which is specifically designed to assess the effectiveness of discrete tokens within the MIM framework.
arXiv Detail & Related papers (2024-07-12T08:25:31Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language
Pre-training [87.69394953339238]
Masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment.
We propose a semantics-enhanced cross-modal MIM framework (SemMIM) for vision-language representation learning.
arXiv Detail & Related papers (2024-03-01T03:25:58Z) - Revisiting Multimodal Representation in Contrastive Learning: From Patch
and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP.
We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z) - Beyond Masking: Demystifying Token-Based Pre-Training for Vision
Transformers [122.01591448013977]
Masked image modeling (MIM) has demonstrated promising results on downstream tasks.
In this paper, we investigate whether there exist other effective ways to learn by recovering missing contents'
We summarize a few design principles for token-based pre-training of vision transformers.
This design achieves superior performance over MIM in a series of downstream recognition tasks without extra computational cost.
arXiv Detail & Related papers (2022-03-27T14:23:29Z) - Vision Matters When It Should: Sanity Checking Multimodal Machine
Translation Models [25.920891392933058]
Multimodal machine translation (MMT) systems have been shown to outperform their text-only neural machine translation (NMT) counterparts when visual context is available.
Recent studies have also shown that the performance of MMT models is only marginally impacted when the associated image is replaced with an unrelated image or noise.
arXiv Detail & Related papers (2021-09-08T03:32:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.