DeepMIM: Deep Supervision for Masked Image Modeling
- URL: http://arxiv.org/abs/2303.08817v2
- Date: Thu, 16 Mar 2023 05:05:46 GMT
- Title: DeepMIM: Deep Supervision for Masked Image Modeling
- Authors: Sucheng Ren, Fangyun Wei, Samuel Albanie, Zheng Zhang, Han Hu
- Abstract summary: Deep supervision was widely used in image classification in the early deep learning era.
With the emergence of normalization techniques and residual connection, deep supervision in image classification was gradually phased out.
We revisit deep supervision for masked image modeling (MIM) that pre-trains a Vision Transformer (ViT) via a mask-and-predict scheme.
- Score: 46.01916629713594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep supervision, which involves extra supervisions to the intermediate
features of a neural network, was widely used in image classification in the
early deep learning era since it significantly reduces the training difficulty
and eases the optimization like avoiding gradient vanish over the vanilla
training. Nevertheless, with the emergence of normalization techniques and
residual connection, deep supervision in image classification was gradually
phased out. In this paper, we revisit deep supervision for masked image
modeling (MIM) that pre-trains a Vision Transformer (ViT) via a
mask-and-predict scheme. Experimentally, we find that deep supervision drives
the shallower layers to learn more meaningful representations, accelerates
model convergence, and expands attention diversities. Our approach, called
DeepMIM, significantly boosts the representation capability of each layer. In
addition, DeepMIM is compatible with many MIM models across a range of
reconstruction targets. For instance, using ViT-B, DeepMIM on MAE achieves 84.2
top-1 accuracy on ImageNet, outperforming MAE by +0.6. By combining DeepMIM
with a stronger tokenizer CLIP, our model achieves state-of-the-art performance
on various downstream tasks, including image classification (85.6 top-1
accuracy on ImageNet-1K, outperforming MAE-CLIP by +0.8), object detection
(52.8 APbox on COCO) and semantic segmentation (53.1 mIoU on ADE20K). Code and
models are available at https://github.com/OliverRensu/DeepMIM.
Related papers
- Adapting LLaMA Decoder to Vision Transformer [65.47663195233802]
This work examines whether decoder-only Transformers such as LLaMA can be adapted to the computer vision field.
We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a causal mask to the self-attention brings an attention collapse issue.
We develop a soft mask strategy that gradually introduces a causal mask to the self-attention at the onset of training to facilitate the optimization behavior.
arXiv Detail & Related papers (2024-04-10T06:30:08Z) - Img2Vec: A Teacher of High Token-Diversity Helps Masked AutoEncoders [17.564722905991776]
We present a pipeline of Image to Vector (Img2Vec) for masked image modeling (MIM) with deep features.
Img2Vec is a simple yet effective framework tailored to deep feature MIM learning, accomplishing superb comprehensive performance on representative vision tasks.
arXiv Detail & Related papers (2023-04-25T03:01:37Z) - Contrastive Tuning: A Little Help to Make Masked Autoencoders Forget [10.290956481715387]
Masked Autoencoder Contrastive Tuning (MAE-CT) is a sequential approach that tunes the rich features such that they form semantic clusters of objects without using any labels.
MaE-CT does not rely on hand-crafted augmentations and frequently achieves its best performances while using only minimal augmentations (crop & flip)
MaE-CT excels over previous self-supervised methods trained on ImageNet in linear probing, k-NN and low-shot classification accuracy as well as in unsupervised clustering accuracy.
arXiv Detail & Related papers (2023-04-20T17:51:09Z) - TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models [31.16595289223858]
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs)
However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach.
We explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones.
arXiv Detail & Related papers (2023-01-03T18:59:54Z) - CAE v2: Context Autoencoder with CLIP Target [63.61868058214267]
Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches.
Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM.
To investigate strategies for refining the CLIP-targeted MIM, we study two critical elements in MIM, i.e., the supervision position and the mask ratio.
arXiv Detail & Related papers (2022-11-17T18:58:33Z) - A Unified View of Masked Image Modeling [117.79456335844439]
Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers.
We introduce a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions.
Experimental results on image classification and semantic segmentation show that MaskDistill achieves comparable or superior performance than state-of-the-art methods.
arXiv Detail & Related papers (2022-10-19T14:59:18Z) - BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers [117.79456335844439]
We propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction.
We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches.
Experiments on image classification and semantic segmentation show that our approach outperforms all compared MIM methods.
arXiv Detail & Related papers (2022-08-12T16:48:10Z) - Adversarial Masking for Self-Supervised Learning [81.25999058340997]
Masked image model (MIM) framework for self-supervised learning, ADIOS, is proposed.
It simultaneously learns a masking function and an image encoder using an adversarial objective.
It consistently improves on state-of-the-art self-supervised learning (SSL) methods on a variety of tasks and datasets.
arXiv Detail & Related papers (2022-01-31T10:23:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.