A Closer Look at Self-Supervised Lightweight Vision Transformers
- URL: http://arxiv.org/abs/2205.14443v2
- Date: Wed, 3 May 2023 15:07:01 GMT
- Title: A Closer Look at Self-Supervised Lightweight Vision Transformers
- Authors: Shaoru Wang, Jin Gao, Zeming Li, Xiaoqin Zhang, Weiming Hu
- Abstract summary: Self-supervised learning on large-scale Vision Transformers (ViTs) as pre-training methods has achieved promising downstream performance.
We benchmark several self-supervised pre-training methods on image classification tasks and some downstream dense prediction tasks.
Even vanilla lightweight ViTs show comparable performance to previous SOTA networks with delicate architecture design.
- Score: 44.44888945683147
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning on large-scale Vision Transformers (ViTs) as
pre-training methods has achieved promising downstream performance. Yet, how
much these pre-training paradigms promote lightweight ViTs' performance is
considerably less studied. In this work, we develop and benchmark several
self-supervised pre-training methods on image classification tasks and some
downstream dense prediction tasks. We surprisingly find that if proper
pre-training is adopted, even vanilla lightweight ViTs show comparable
performance to previous SOTA networks with delicate architecture design. It
breaks the recently popular conception that vanilla ViTs are not suitable for
vision tasks in lightweight regimes. We also point out some defects of such
pre-training, e.g., failing to benefit from large-scale pre-training data and
showing inferior performance on data-insufficient downstream tasks.
Furthermore, we analyze and clearly show the effect of such pre-training by
analyzing the properties of the layer representation and attention maps for
related models. Finally, based on the above analyses, a distillation strategy
during pre-training is developed, which leads to further downstream performance
improvement for MAE-based pre-training. Code is available at
https://github.com/wangsr126/mae-lite.
Related papers
- How Effective is Pre-training of Large Masked Autoencoders for Downstream Earth Observation Tasks? [9.515532265294187]
Self-supervised pre-training has proven highly effective for many computer vision tasks.
It remains unclear under which conditions pre-trained models offer significant advantages over training from scratch.
arXiv Detail & Related papers (2024-09-27T08:15:14Z) - An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training [51.622652121580394]
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) has enabled promising downstream performance on top of the learned self-supervised ViT features.
In this paper, we question if the textitextremely simple lightweight ViTs' fine-tuning performance can also benefit from this pre-training paradigm.
Our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design ($5.7M$/$6.5M$) can achieve $79.4%$/$78.9%$ top-1 accuracy on ImageNet-1
arXiv Detail & Related papers (2024-04-18T14:14:44Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - Towards Inadequately Pre-trained Models in Transfer Learning [37.66278189011681]
Better ImageNet pre-trained models have been demonstrated to have better transferability to downstream tasks.
In this paper, we found that during the same pre-training process, models at middle epochs, which is inadequately pre-trained, can outperform fully trained models.
Our discoveries suggest that, during pre-training, models tend to first learn spectral components corresponding to large singular values.
arXiv Detail & Related papers (2022-03-09T12:15:55Z) - On Efficient Transformer and Image Pre-training for Low-level Vision [74.22436001426517]
Pre-training has marked numerous state of the arts in high-level computer vision.
We present an in-depth study of image pre-training.
We find pre-training plays strikingly different roles in low-level tasks.
arXiv Detail & Related papers (2021-12-19T15:50:48Z) - Improved Fine-tuning by Leveraging Pre-training Data: Theory and
Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications.
Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy.
We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z) - The Lottery Tickets Hypothesis for Supervised and Self-supervised
Pre-training in Computer Vision Models [115.49214555402567]
Pre-trained weights often boost a wide range of downstream tasks including classification, detection, and segmentation.
Recent studies suggest that pre-training benefits from gigantic model capacity.
In this paper, we examine supervised and self-supervised pre-trained models through the lens of the lottery ticket hypothesis (LTH)
arXiv Detail & Related papers (2020-12-12T21:53:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.