Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN
- URL: http://arxiv.org/abs/2205.13943v4
- Date: Fri, 2 Jun 2023 10:21:16 GMT
- Title: Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN
- Authors: Siyuan Li, Di Wu, Fang Wu, Zelin Zang, Stan.Z.Li
- Abstract summary: Masked image modeling, an emerging self-supervised pre-training method, has shown impressive success across numerous downstream vision tasks with Vision transformers.
We propose an Architecture-Agnostic Masked Image Modeling framework (A$2$MIM), which is compatible with both Transformers and CNNs in a unified way.
- Score: 38.87225202482656
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked image modeling, an emerging self-supervised pre-training method, has
shown impressive success across numerous downstream vision tasks with Vision
transformers. Its underlying idea is simple: a portion of the input image is
masked out and then reconstructed via a pre-text task. However, the working
principle behind MIM is not well explained, and previous studies insist that
MIM primarily works for the Transformer family but is incompatible with CNNs.
In this work, we observe that MIM essentially teaches the model to learn better
middle-order interactions among patches for more generalized feature
extraction. We then propose an Architecture-Agnostic Masked Image Modeling
framework (A$^2$MIM), which is compatible with both Transformers and CNNs in a
unified way. Extensive experiments on popular benchmarks show that A$^2$MIM
learns better representations without explicit design and endows the backbone
model with the stronger capability to transfer to various downstream tasks.
Related papers
- VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks [60.22144823791902]
We unveil a LLaMA-like vision transformer in plain and pyramid forms, termed VisionLLaMA, which is tailored for this purpose.
VisionLLaMA is a unified and generic modelling framework for solving most vision tasks.
arXiv Detail & Related papers (2024-03-01T13:30:51Z) - Morphing Tokens Draw Strong Masked Image Models [28.356863521946607]
Masked image modeling (MIM) has emerged as a promising approach for training Vision Transformers (ViTs)
We introduce a novel self-supervision signal called Dynamic Token Morphing (DTM), which dynamically aggregates contextually related tokens to yield contextualized targets.
DTM is compatible with various SSL frameworks; we showcase improved MIM results by employing DTM, barely introducing extra training costs.
arXiv Detail & Related papers (2023-12-30T14:53:09Z) - PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling [83.67628239775878]
Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT.
This paper undertakes a fundamental analysis of MIM from the perspective of pixel reconstruction.
We propose a remarkably simple and effective method, ourmethod, that entails two strategies.
arXiv Detail & Related papers (2023-03-04T13:38:51Z) - Instruction-Following Agents with Multimodal Transformer [95.70039658112873]
We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments.
Our method consists of a multimodal transformer that encodes visual observations and language instructions.
We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings.
arXiv Detail & Related papers (2022-10-24T17:46:47Z) - Beyond Masking: Demystifying Token-Based Pre-Training for Vision
Transformers [122.01591448013977]
Masked image modeling (MIM) has demonstrated promising results on downstream tasks.
In this paper, we investigate whether there exist other effective ways to learn by recovering missing contents'
We summarize a few design principles for token-based pre-training of vision transformers.
This design achieves superior performance over MIM in a series of downstream recognition tasks without extra computational cost.
arXiv Detail & Related papers (2022-03-27T14:23:29Z) - On Vision Features in Multimodal Machine Translation [34.41229863267296]
We develop a selective attention model to study the patch-level contribution of an image in multimodal machine translation.
Our results suggest the need of carefully examining MMT models, especially when current benchmarks are small-scale and biased.
arXiv Detail & Related papers (2022-03-17T08:51:09Z) - Probing Inter-modality: Visual Parsing with Self-Attention for
Vision-Language Pre-training [139.4566371416662]
Vision-Language Pre-training aims to learn multi-modal representations from image-text pairs.
CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies.
arXiv Detail & Related papers (2021-06-25T08:04:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.