Related papers: Local Masking Meets Progressive Freezing: Crafting Efficient Vision Transformers for Self-Supervised Learning

Local Masking Meets Progressive Freezing: Crafting Efficient Vision Transformers for Self-Supervised Learning

URL: http://arxiv.org/abs/2312.02194v1
Date: Sat, 2 Dec 2023 11:10:09 GMT
Title: Local Masking Meets Progressive Freezing: Crafting Efficient Vision Transformers for Self-Supervised Learning
Authors: Utku Mert Topcuoglu, Erdem Akag\"und\"uz
Abstract summary: We present an innovative approach to self-supervised learning for Vision Transformers (ViTs) This method focuses on enhancing the efficiency and speed of initial layer training in ViTs. Our approach employs a novel multi-scale reconstruction process that fosters efficient learning in initial layers.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In this paper, we present an innovative approach to self-supervised learning for Vision Transformers (ViTs), integrating local masked image modeling with progressive layer freezing. This method focuses on enhancing the efficiency and speed of initial layer training in ViTs. By systematically freezing specific layers at strategic points during training, we reduce computational demands while maintaining or improving learning capabilities. Our approach employs a novel multi-scale reconstruction process that fosters efficient learning in initial layers and enhances semantic comprehension across scales. The results demonstrate a substantial reduction in training time (~12.5\%) with a minimal impact on model accuracy (decrease in top-1 accuracy by 0.6\%). Our method achieves top-1 and top-5 accuracies of 82.6\% and 96.2\%, respectively, underscoring its potential in scenarios where computational resources and time are critical. This work marks an advancement in the field of self-supervised learning for computer vision. The implementation of our approach is available at our project's GitHub repository: github.com/utkutpcgl/ViTFreeze.

Related papers

An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training [51.622652121580394]
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the textitextremely simple lightweight ViTs' fine-tuning performance can also benefit from this pre-training paradigm. Our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design ($5.7M$/$6.5M$) can achieve $79.4%$/$78.9%$ top-1 accuracy on ImageNet-1
arXiv Detail & Related papers (2024-04-18T14:14:44Z)
Reduction of Class Activation Uncertainty with Background Information [0.0]
Multitask learning is a popular approach to training high-performing neural networks with improved generalization. We propose a background class to achieve improved generalization at a lower computation compared to multitask learning. We present a methodology for selecting background images and discuss potential future improvements.
arXiv Detail & Related papers (2023-05-05T01:40:00Z)
Skip-Attention: Improving Vision Transformers by Paying Less Attention [55.47058516775423]
Vision computation transformers (ViTs) use expensive self-attention operations in every layer. We propose SkipAt, a method to reuse self-attention from preceding layers to approximate attention at one or more subsequent layers. We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS.
arXiv Detail & Related papers (2023-01-05T18:59:52Z)
EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones [80.662250618795]
This paper presents a new curriculum learning approach for the efficient training of visual backbones (e.g., vision Transformers) As an off-the-shelf method, it reduces the wall-time training cost of a wide variety of popular models by >1.5x on ImageNet-1K/22K without sacrificing accuracy.
arXiv Detail & Related papers (2022-11-17T17:38:55Z)
Training a Vision Transformer from scratch in less than 24 hours with 1 GPU [10.517362955718799]
We introduce some algorithmic improvements to enable training a ViT model from scratch with limited hardware (1 GPU) and time (24 hours) resources. We develop a new image size curriculum learning strategy, which allows to reduce the number of patches extracted from each image at the beginning of the training. Finally, we propose a new variant of the popular ImageNet1k benchmark by adding hardware and time constraints.
arXiv Detail & Related papers (2022-11-09T20:36:46Z)
Revisiting Classifier: Transferring Vision-Language Models for Video Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research. In this study, we focus on transferring knowledge for video classification tasks. We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z)
Automated Progressive Learning for Efficient Training of Vision Transformers [125.22744987949227]
Vision Transformers (ViTs) have come with a voracious appetite for computing power, high-lighting the urgent need to develop efficient training methods for ViTs. Progressive learning, a training scheme where the model capacity grows progressively during training, has started showing its ability in efficient training. In this paper, we take a practical step towards efficient training of ViTs by customizing and automating progressive learning.
arXiv Detail & Related papers (2022-03-28T05:37:08Z)
MST: Masked Self-Supervised Transformer for Visual Representation [52.099722121603506]
Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) We present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image. MST achieves Top-1 accuracy of 76.9% with DeiT-S only using 300-epoch pre-training by linear evaluation.
arXiv Detail & Related papers (2021-06-10T11:05:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.