Morphing Tokens Draw Strong Masked Image Models
- URL: http://arxiv.org/abs/2401.00254v3
- Date: Thu, 10 Oct 2024 16:07:42 GMT
- Title: Morphing Tokens Draw Strong Masked Image Models
- Authors: Taekyung Kim, Byeongho Heo, Dongyoon Han,
- Abstract summary: Masked image modeling (MIM) has emerged as a promising approach for training Vision Transformers (ViTs)
We introduce a novel self-supervision signal called Dynamic Token Morphing (DTM), which dynamically aggregates contextually related tokens to yield contextualized targets.
DTM is compatible with various SSL frameworks; we showcase improved MIM results by employing DTM, barely introducing extra training costs.
- Score: 28.356863521946607
- License:
- Abstract: Masked image modeling (MIM) has emerged as a promising approach for training Vision Transformers (ViTs). The essence of MIM lies in the token-wise prediction of masked tokens, which aims to predict targets tokenized from images or generated by pre-trained models like vision-language models. While using tokenizers or pre-trained models are plausible MIM targets, they often offer spatially inconsistent targets even for neighboring tokens, complicating models to learn unified and discriminative representations. Our pilot study identifies spatial inconsistencies and suggests that resolving them can accelerate representation learning. Building upon this insight, we introduce a novel self-supervision signal called Dynamic Token Morphing (DTM), which dynamically aggregates contextually related tokens to yield contextualized targets, thereby mitigating spatial inconsistency. DTM is compatible with various SSL frameworks; we showcase improved MIM results by employing DTM, barely introducing extra training costs. Our method facilitates training by using consistent targets, resulting in 1) faster training and 2) reduced losses. Experiments on ImageNet-1K and ADE20K demonstrate the superiority of our method compared with state-of-the-art, complex MIM methods. Furthermore, the comparative evaluation of the iNaturalists and fine-grained visual classification datasets further validates the transferability of our method on various downstream tasks. Code is available at https://github.com/naver-ai/dtm
Related papers
- Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens [53.99177152562075]
Scaling up autoregressive models in vision has not proven as beneficial as in large language models.
We focus on two critical factors: whether models use discrete or continuous tokens, and whether tokens are generated in a random or fixed order using BERT- or GPT-like transformer architectures.
Our results show that while all models scale effectively in terms of validation loss, their evaluation performance -- measured by FID, GenEval score, and visual quality -- follows different trends.
arXiv Detail & Related papers (2024-10-17T17:59:59Z) - Enhancing Vision-Language Model with Unmasked Token Alignment [37.12838142681491]
This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations.
UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder.
arXiv Detail & Related papers (2024-05-29T11:48:17Z) - Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language
Pre-training [87.69394953339238]
Masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment.
We propose a semantics-enhanced cross-modal MIM framework (SemMIM) for vision-language representation learning.
arXiv Detail & Related papers (2024-03-01T03:25:58Z) - Latent State Models of Training Dynamics [51.88132043461152]
We train models with different random seeds and compute a variety of metrics throughout training.
We then fit a hidden Markov model (HMM) over the resulting sequences of metrics.
We use the HMM representation to study phase transitions and identify latent "detour" states that slow down convergence.
arXiv Detail & Related papers (2023-08-18T13:20:08Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Disjoint Masking with Joint Distillation for Efficient Masked Image
Modeling [36.231030262831005]
Masked image modeling (MIM) has shown great promise for self-supervised learning (SSL)
We introduce a conceptually simple yet learning-efficient MIM training scheme, termed Disjoint Masking with Joint Distillation (DMJD)
arXiv Detail & Related papers (2022-12-31T15:50:02Z) - Improve Transformer Pre-Training with Decoupled Directional Relative
Position Encoding and Representation Differentiations [23.2969212998404]
We revisit the Transformer-based pre-trained language models and identify two problems that may limit the expressiveness of the model.
Existing relative position encoding models confuse two heterogeneous information: relative distance and direction.
We propose two novel techniques to improve pre-trained language models.
arXiv Detail & Related papers (2022-10-09T12:35:04Z) - MimCo: Masked Image Modeling Pre-training with Contrastive Teacher [14.413674270588023]
Masked image modeling (MIM) has received much attention in self-supervised learning (SSL)
visualizations show that the learned representations are less separable, especially compared to those based on contrastive learning pre-training.
We propose a novel and flexible pre-training framework, named MimCo, which combines MIM and contrastive learning through two-stage pre-training.
arXiv Detail & Related papers (2022-09-07T10:59:05Z) - Beyond Masking: Demystifying Token-Based Pre-Training for Vision
Transformers [122.01591448013977]
Masked image modeling (MIM) has demonstrated promising results on downstream tasks.
In this paper, we investigate whether there exist other effective ways to learn by recovering missing contents'
We summarize a few design principles for token-based pre-training of vision transformers.
This design achieves superior performance over MIM in a series of downstream recognition tasks without extra computational cost.
arXiv Detail & Related papers (2022-03-27T14:23:29Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.