FastMIM: Expediting Masked Image Modeling Pre-training for Vision
- URL: http://arxiv.org/abs/2212.06593v1
- Date: Tue, 13 Dec 2022 14:09:32 GMT
- Title: FastMIM: Expediting Masked Image Modeling Pre-training for Vision
- Authors: Jianyuan Guo, Kai Han, Han Wu, Yehui Tang, Yunhe Wang and Chang Xu
- Abstract summary: FastMIM is a framework for pre-training vision backbones with low-resolution input images.
It reconstructs Histograms of Oriented Gradients (HOG) feature instead of original RGB values of the input images.
It can achieve 83.8%/84.1% top-1 accuracy on ImageNet-1K with ViT-B/Swin-B as backbones.
- Score: 65.47756720190155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The combination of transformers and masked image modeling (MIM) pre-training
framework has shown great potential in various vision tasks. However, the
pre-training computational budget is too heavy and withholds the MIM from
becoming a practical training paradigm. This paper presents FastMIM, a simple
and generic framework for expediting masked image modeling with the following
two steps: (i) pre-training vision backbones with low-resolution input images;
and (ii) reconstructing Histograms of Oriented Gradients (HOG) feature instead
of original RGB values of the input images. In addition, we propose FastMIM-P
to progressively enlarge the input resolution during pre-training stage to
further enhance the transfer results of models with high capacity. We point out
that: (i) a wide range of input resolutions in pre-training phase can lead to
similar performances in fine-tuning phase and downstream tasks such as
detection and segmentation; (ii) the shallow layers of encoder are more
important during pre-training and discarding last several layers can speed up
the training stage with no harm to fine-tuning performance; (iii) the decoder
should match the size of selected network; and (iv) HOG is more stable than RGB
values when resolution transfers;. Equipped with FastMIM, all kinds of vision
backbones can be pre-trained in an efficient way. For example, we can achieve
83.8%/84.1% top-1 accuracy on ImageNet-1K with ViT-B/Swin-B as backbones.
Compared to previous relevant approaches, we can achieve comparable or better
top-1 accuracy while accelerate the training procedure by $\sim$5$\times$. Code
can be found in https://github.com/ggjy/FastMIM.pytorch.
Related papers
- Enhancing pretraining efficiency for medical image segmentation via transferability metrics [0.0]
In medical image segmentation tasks, the scarcity of labeled training data poses a significant challenge.
We introduce a novel transferability metric, based on contrastive learning, that measures how robustly a pretrained model is able to represent the target data.
arXiv Detail & Related papers (2024-10-24T12:11:52Z) - MULLER: Multilayer Laplacian Resizer for Vision [16.67232499096539]
We present an extremely lightweight multilayer Laplacian resizer with only a handful of trainable parameters, dubbed MULLER resizer.
We show that MULLER can be easily plugged into various training pipelines, and it effectively boosts the performance of the underlying vision task with little to no extra cost.
arXiv Detail & Related papers (2023-04-06T04:39:21Z) - CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets [50.6643933702394]
We present a single-model self-supervised hybrid pre-training framework for RGB and depth modalities, termed as CoMAE.
Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling.
arXiv Detail & Related papers (2023-02-13T07:09:45Z) - Multi-scale Transformer Network with Edge-aware Pre-training for
Cross-Modality MR Image Synthesis [52.41439725865149]
Cross-modality magnetic resonance (MR) image synthesis can be used to generate missing modalities from given ones.
Existing (supervised learning) methods often require a large number of paired multi-modal data to train an effective synthesis model.
We propose a Multi-scale Transformer Network (MT-Net) with edge-aware pre-training for cross-modality MR image synthesis.
arXiv Detail & Related papers (2022-12-02T11:40:40Z) - Stable Optimization for Large Vision Model Based Deep Image Prior in
Cone-Beam CT Reconstruction [6.558735319783205]
Large Vision Model (LVM) has recently demonstrated great potential for medical imaging tasks.
Deep Image Prior (DIP) effectively guides an untrained neural network to generate high-quality CBCT images without any training data.
We propose a stable optimization method for the forward-model-free DIP model for sparse-view CBCT.
arXiv Detail & Related papers (2022-03-23T15:16:29Z) - Corrupted Image Modeling for Self-Supervised Visual Pre-Training [103.99311611776697]
We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training.
CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens.
After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks.
arXiv Detail & Related papers (2022-02-07T17:59:04Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z) - Memory Efficient Meta-Learning with Large Images [62.70515410249566]
Meta learning approaches to few-shot classification are computationally efficient at test time requiring just a few optimization steps or single forward pass to learn a new task.
This limitation arises because a task's entire support set, which can contain up to 1000 images, must be processed before an optimization step can be taken.
We propose LITE, a general and memory efficient episodic training scheme that enables meta-training on large tasks composed of large images on a single GPU.
arXiv Detail & Related papers (2021-07-02T14:37:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.