Related papers: FastMIM: Expediting Masked Image Modeling Pre-training for Vision

FastMIM: Expediting Masked Image Modeling Pre-training for Vision

URL: http://arxiv.org/abs/2212.06593v1
Date: Tue, 13 Dec 2022 14:09:32 GMT
Title: FastMIM: Expediting Masked Image Modeling Pre-training for Vision
Authors: Jianyuan Guo, Kai Han, Han Wu, Yehui Tang, Yunhe Wang and Chang Xu
Abstract summary: FastMIM is a framework for pre-training vision backbones with low-resolution input images. It reconstructs Histograms of Oriented Gradients (HOG) feature instead of original RGB values of the input images. It can achieve 83.8%/84.1% top-1 accuracy on ImageNet-1K with ViT-B/Swin-B as backbones.
Score: 65.47756720190155
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The combination of transformers and masked image modeling (MIM) pre-training framework has shown great potential in various vision tasks. However, the pre-training computational budget is too heavy and withholds the MIM from becoming a practical training paradigm. This paper presents FastMIM, a simple and generic framework for expediting masked image modeling with the following two steps: (i) pre-training vision backbones with low-resolution input images; and (ii) reconstructing Histograms of Oriented Gradients (HOG) feature instead of original RGB values of the input images. In addition, we propose FastMIM-P to progressively enlarge the input resolution during pre-training stage to further enhance the transfer results of models with high capacity. We point out that: (i) a wide range of input resolutions in pre-training phase can lead to similar performances in fine-tuning phase and downstream tasks such as detection and segmentation; (ii) the shallow layers of encoder are more important during pre-training and discarding last several layers can speed up the training stage with no harm to fine-tuning performance; (iii) the decoder should match the size of selected network; and (iv) HOG is more stable than RGB values when resolution transfers;. Equipped with FastMIM, all kinds of vision backbones can be pre-trained in an efficient way. For example, we can achieve 83.8%/84.1% top-1 accuracy on ImageNet-1K with ViT-B/Swin-B as backbones. Compared to previous relevant approaches, we can achieve comparable or better top-1 accuracy while accelerate the training procedure by $\sim$5$\times$. Code can be found in https://github.com/ggjy/FastMIM.pytorch.

Related papers

Enhancing pretraining efficiency for medical image segmentation via transferability metrics [0.0]
In medical image segmentation tasks, the scarcity of labeled training data poses a significant challenge. We introduce a novel transferability metric, based on contrastive learning, that measures how robustly a pretrained model is able to represent the target data.
arXiv Detail & Related papers (2024-10-24T12:11:52Z)
MULLER: Multilayer Laplacian Resizer for Vision [16.67232499096539]
We present an extremely lightweight multilayer Laplacian resizer with only a handful of trainable parameters, dubbed MULLER resizer. We show that MULLER can be easily plugged into various training pipelines, and it effectively boosts the performance of the underlying vision task with little to no extra cost.
arXiv Detail & Related papers (2023-04-06T04:39:21Z)
CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets [50.6643933702394]
We present a single-model self-supervised hybrid pre-training framework for RGB and depth modalities, termed as CoMAE. Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling.
arXiv Detail & Related papers (2023-02-13T07:09:45Z)
Multi-scale Transformer Network with Edge-aware Pre-training for Cross-Modality MR Image Synthesis [52.41439725865149]
Cross-modality magnetic resonance (MR) image synthesis can be used to generate missing modalities from given ones. Existing (supervised learning) methods often require a large number of paired multi-modal data to train an effective synthesis model. We propose a Multi-scale Transformer Network (MT-Net) with edge-aware pre-training for cross-modality MR image synthesis.
arXiv Detail & Related papers (2022-12-02T11:40:40Z)
Stable Optimization for Large Vision Model Based Deep Image Prior in Cone-Beam CT Reconstruction [6.558735319783205]
Large Vision Model (LVM) has recently demonstrated great potential for medical imaging tasks. Deep Image Prior (DIP) effectively guides an untrained neural network to generate high-quality CBCT images without any training data. We propose a stable optimization method for the forward-model-free DIP model for sparse-view CBCT.
arXiv Detail & Related papers (2022-03-23T15:16:29Z)
Corrupted Image Modeling for Self-Supervised Visual Pre-Training [103.99311611776697]
We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training. CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens. After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks.
arXiv Detail & Related papers (2022-02-07T17:59:04Z)
Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
Memory Efficient Meta-Learning with Large Images [62.70515410249566]
Meta learning approaches to few-shot classification are computationally efficient at test time requiring just a few optimization steps or single forward pass to learn a new task. This limitation arises because a task's entire support set, which can contain up to 1000 images, must be processed before an optimization step can be taken. We propose LITE, a general and memory efficient episodic training scheme that enables meta-training on large tasks composed of large images on a single GPU.
arXiv Detail & Related papers (2021-07-02T14:37:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.