Morphing Tokens Draw Strong Masked Image Models
- URL: http://arxiv.org/abs/2401.00254v4
- Date: Fri, 21 Mar 2025 09:24:14 GMT
- Title: Morphing Tokens Draw Strong Masked Image Models
- Authors: Taekyung Kim, Byeongho Heo, Dongyoon Han,
- Abstract summary: Masked image modeling (MIM) has emerged as a promising approach for pre-training Vision Transformers (ViTs)<n>We introduce Dynamic Token Morphing (DTM), a novel method that dynamically aggregates tokens while preserving context to generate contextualized targets.<n>DTM is compatible with various SSL frameworks; we showcase significantly improved MIM results, barely introducing extra training costs.
- Score: 28.356863521946607
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked image modeling (MIM) has emerged as a promising approach for pre-training Vision Transformers (ViTs). MIMs predict masked tokens token-wise to recover target signals that are tokenized from images or generated by pre-trained models like vision-language models. While using tokenizers or pre-trained models is viable, they often offer spatially inconsistent supervision even for neighboring tokens, hindering models from learning discriminative representations. Our pilot study identifies spatial inconsistency in supervisory signals and suggests that addressing it can improve representation learning. Building upon this insight, we introduce Dynamic Token Morphing (DTM), a novel method that dynamically aggregates tokens while preserving context to generate contextualized targets, thereby likely reducing spatial inconsistency. DTM is compatible with various SSL frameworks; we showcase significantly improved MIM results, barely introducing extra training costs. Our method facilitates MIM training by using more spatially consistent targets, resulting in improved training trends as evidenced by lower losses. Experiments on ImageNet-1K and ADE20K demonstrate DTM's superiority, which surpasses complex state-of-the-art MIM methods. Furthermore, the evaluation of transfer learning on downstream tasks like iNaturalist, along with extensive empirical studies, supports DTM's effectiveness.
Related papers
- Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens [53.99177152562075]
Scaling up autoregressive models in vision has not proven as beneficial as in large language models.
We focus on two critical factors: whether models use discrete or continuous tokens, and whether tokens are generated in a random or fixed order using BERT- or GPT-like transformer architectures.
Our results show that while all models scale effectively in terms of validation loss, their evaluation performance -- measured by FID, GenEval score, and visual quality -- follows different trends.
arXiv Detail & Related papers (2024-10-17T17:59:59Z) - MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling [64.09238330331195]
We propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework.
Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss.
We show that MMAR demonstrates much more superior performance than other joint multi-modal models.
arXiv Detail & Related papers (2024-10-14T17:57:18Z) - Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning [18.424840375721303]
Masked Image Modeling (MIM) has emerged as a promising method for deriving visual representations from unlabeled image data by predicting missing pixels from masked portions of images.
A promising yet unrealized framework is learning representations through masked reconstruction in latent space, combining the locality of MIM with the high-level targets.
This study is among the first to thoroughly analyze and address the challenges of such framework, which we refer to as Latent MIM.
arXiv Detail & Related papers (2024-07-22T17:54:41Z) - On the Role of Discrete Tokenization in Visual Representation Learning [35.10829554701771]
Masked image modeling (MIM) has gained popularity alongside contrastive learning methods.
discrete tokens as the reconstruction target, but the theoretical underpinnings of this choice remain underexplored.
We provide a comprehensive theoretical understanding on how discrete tokenization affects the model's generalization capabilities.
We propose a novel metric named TCAS, which is specifically designed to assess the effectiveness of discrete tokens within the MIM framework.
arXiv Detail & Related papers (2024-07-12T08:25:31Z) - Enhancing Vision-Language Model with Unmasked Token Alignment [37.12838142681491]
This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations.
UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder.
arXiv Detail & Related papers (2024-05-29T11:48:17Z) - Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language
Pre-training [87.69394953339238]
Masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment.
We propose a semantics-enhanced cross-modal MIM framework (SemMIM) for vision-language representation learning.
arXiv Detail & Related papers (2024-03-01T03:25:58Z) - MIMIC: Mask Image Pre-training with Mix Contrastive Fine-tuning for
Facial Expression Recognition [11.820043444385432]
We introduce a novel FER training paradigm named Mask Image pre-training with MIx Contrastive fine-tuning (MIMIC)
In the initial phase, we pre-train the ViT via masked image reconstruction on general images.
In the fine-tuning stage, we introduce a mix-supervised contrastive learning process, which enhances the model with a more extensive range of positive samples.
arXiv Detail & Related papers (2024-01-14T10:30:32Z) - Latent State Models of Training Dynamics [51.88132043461152]
We train models with different random seeds and compute a variety of metrics throughout training.
We then fit a hidden Markov model (HMM) over the resulting sequences of metrics.
We use the HMM representation to study phase transitions and identify latent "detour" states that slow down convergence.
arXiv Detail & Related papers (2023-08-18T13:20:08Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Beyond Pretrained Features: Noisy Image Modeling Provides Adversarial
Defense [52.66971714830943]
Masked image modeling (MIM) has made it a prevailing framework for self-supervised visual representation learning.
In this paper, we investigate how this powerful self-supervised learning paradigm can provide adversarial robustness to downstream classifiers.
We propose an adversarial defense method, referred to as De3, by exploiting the pretrained decoder for denoising.
arXiv Detail & Related papers (2023-02-02T12:37:24Z) - Disjoint Masking with Joint Distillation for Efficient Masked Image
Modeling [36.231030262831005]
Masked image modeling (MIM) has shown great promise for self-supervised learning (SSL)
We introduce a conceptually simple yet learning-efficient MIM training scheme, termed Disjoint Masking with Joint Distillation (DMJD)
arXiv Detail & Related papers (2022-12-31T15:50:02Z) - Improve Transformer Pre-Training with Decoupled Directional Relative
Position Encoding and Representation Differentiations [23.2969212998404]
We revisit the Transformer-based pre-trained language models and identify two problems that may limit the expressiveness of the model.
Existing relative position encoding models confuse two heterogeneous information: relative distance and direction.
We propose two novel techniques to improve pre-trained language models.
arXiv Detail & Related papers (2022-10-09T12:35:04Z) - MimCo: Masked Image Modeling Pre-training with Contrastive Teacher [14.413674270588023]
Masked image modeling (MIM) has received much attention in self-supervised learning (SSL)
visualizations show that the learned representations are less separable, especially compared to those based on contrastive learning pre-training.
We propose a novel and flexible pre-training framework, named MimCo, which combines MIM and contrastive learning through two-stage pre-training.
arXiv Detail & Related papers (2022-09-07T10:59:05Z) - Masked Image Modeling with Denoising Contrast [30.31920660487222]
Masked image modeling dominates this line of research with state-of-the-art performance on vision Transformers.
We introduce a new pre-training method, ConMIM, to produce simple intra-image inter-patch contrastive constraints.
ConMIM-pretrained vision Transformers with various scales achieve promising results on downstream image classification, semantic segmentation, object detection, and instance segmentation tasks.
arXiv Detail & Related papers (2022-05-19T15:22:29Z) - Beyond Masking: Demystifying Token-Based Pre-Training for Vision
Transformers [122.01591448013977]
Masked image modeling (MIM) has demonstrated promising results on downstream tasks.
In this paper, we investigate whether there exist other effective ways to learn by recovering missing contents'
We summarize a few design principles for token-based pre-training of vision transformers.
This design achieves superior performance over MIM in a series of downstream recognition tasks without extra computational cost.
arXiv Detail & Related papers (2022-03-27T14:23:29Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.