Training a Vision Transformer from scratch in less than 24 hours with 1
GPU
- URL: http://arxiv.org/abs/2211.05187v1
- Date: Wed, 9 Nov 2022 20:36:46 GMT
- Title: Training a Vision Transformer from scratch in less than 24 hours with 1
GPU
- Authors: Saghar Irandoust, Thibaut Durand, Yunduz Rakhmangulova, Wenjie Zi,
Hossein Hajimirsadeghi
- Abstract summary: We introduce some algorithmic improvements to enable training a ViT model from scratch with limited hardware (1 GPU) and time (24 hours) resources.
We develop a new image size curriculum learning strategy, which allows to reduce the number of patches extracted from each image at the beginning of the training.
Finally, we propose a new variant of the popular ImageNet1k benchmark by adding hardware and time constraints.
- Score: 10.517362955718799
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have become central to recent advances in computer vision.
However, training a vision Transformer (ViT) model from scratch can be resource
intensive and time consuming. In this paper, we aim to explore approaches to
reduce the training costs of ViT models. We introduce some algorithmic
improvements to enable training a ViT model from scratch with limited hardware
(1 GPU) and time (24 hours) resources. First, we propose an efficient approach
to add locality to the ViT architecture. Second, we develop a new image size
curriculum learning strategy, which allows to reduce the number of patches
extracted from each image at the beginning of the training. Finally, we propose
a new variant of the popular ImageNet1k benchmark by adding hardware and time
constraints. We evaluate our contributions on this benchmark, and show they can
significantly improve performances given the proposed training budget. We will
share the code in https://github.com/BorealisAI/efficient-vit-training.
Related papers
- Local Masking Meets Progressive Freezing: Crafting Efficient Vision
Transformers for Self-Supervised Learning [0.0]
We present an innovative approach to self-supervised learning for Vision Transformers (ViTs)
This method focuses on enhancing the efficiency and speed of initial layer training in ViTs.
Our approach employs a novel multi-scale reconstruction process that fosters efficient learning in initial layers.
arXiv Detail & Related papers (2023-12-02T11:10:09Z) - EfficientTrain: Exploring Generalized Curriculum Learning for Training
Visual Backbones [80.662250618795]
This paper presents a new curriculum learning approach for the efficient training of visual backbones (e.g., vision Transformers)
As an off-the-shelf method, it reduces the wall-time training cost of a wide variety of popular models by >1.5x on ImageNet-1K/22K without sacrificing accuracy.
arXiv Detail & Related papers (2022-11-17T17:38:55Z) - Rethinking Hierarchicies in Pre-trained Plain Vision Transformer [76.35955924137986]
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective.
customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT.
This paper proposes a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training.
arXiv Detail & Related papers (2022-11-03T13:19:23Z) - DeiT III: Revenge of the ViT [56.46810490275699]
A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks.
Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT.
arXiv Detail & Related papers (2022-04-14T17:13:44Z) - Automated Progressive Learning for Efficient Training of Vision
Transformers [125.22744987949227]
Vision Transformers (ViTs) have come with a voracious appetite for computing power, high-lighting the urgent need to develop efficient training methods for ViTs.
Progressive learning, a training scheme where the model capacity grows progressively during training, has started showing its ability in efficient training.
In this paper, we take a practical step towards efficient training of ViTs by customizing and automating progressive learning.
arXiv Detail & Related papers (2022-03-28T05:37:08Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - BEiT: BERT Pre-Training of Image Transformers [43.704968112586876]
We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional representation from Image Transformers.
Specifically, each image has two views in our pre-training, i.e., image patches, and visual tokens.
We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer.
The pre-training objective is to recover the original visual tokens based on the corrupted image patches.
arXiv Detail & Related papers (2021-06-15T16:02:37Z) - Self-Supervised Learning with Swin Transformers [24.956637957269926]
We present a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture.
The approach basically has no new inventions, which is combined from MoCo v2 and BYOL.
The performance is slightly better than recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks.
arXiv Detail & Related papers (2021-05-10T17:59:45Z) - Pre-Trained Image Processing Transformer [95.93031793337613]
We develop a new pre-trained model, namely, image processing transformer (IPT)
We present to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs.
IPT model is trained on these images with multi-heads and multi-tails.
arXiv Detail & Related papers (2020-12-01T09:42:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.