Parameter Efficient Fine-tuning of Self-supervised ViTs without Catastrophic Forgetting
- URL: http://arxiv.org/abs/2404.17245v2
- Date: Fri, 5 Jul 2024 03:28:16 GMT
- Title: Parameter Efficient Fine-tuning of Self-supervised ViTs without Catastrophic Forgetting
- Authors: Reza Akbarian Bafghi, Nidhin Harilal, Claire Monteleoni, Maziar Raissi,
- Abstract summary: Post-pre-training and fine-tuning on new tasks can significantly degrade the model's original general abilities.
Overcoming this stability-plasticity dilemma is crucial for enabling ViTs to continuously learn and adapt to new domains.
Our experiments reveal that using either Block Expansion or LoRA on self-supervised pre-trained ViTs surpass fully fine-tuned ViTs in new domains.
- Score: 0.5249805590164901
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Artificial neural networks often suffer from catastrophic forgetting, where learning new concepts leads to a complete loss of previously acquired knowledge. We observe that this issue is particularly magnified in vision transformers (ViTs), where post-pre-training and fine-tuning on new tasks can significantly degrade the model's original general abilities. For instance, a DINO ViT-Base/16 pre-trained on ImageNet-1k loses over 70% accuracy on ImageNet-1k after just 10 iterations of fine-tuning on CIFAR-100. Overcoming this stability-plasticity dilemma is crucial for enabling ViTs to continuously learn and adapt to new domains while preserving their initial knowledge. In this work, we study two new parameter-efficient fine-tuning strategies: (1)~Block Expansion, and (2) Low-rank adaptation (LoRA). Our experiments reveal that using either Block Expansion or LoRA on self-supervised pre-trained ViTs surpass fully fine-tuned ViTs in new domains while offering significantly greater parameter efficiency. Notably, we find that Block Expansion experiences only a minimal performance drop in the pre-training domain, thereby effectively mitigating catastrophic forgetting in pre-trained ViTs.
Related papers
- ExPLoRA: Parameter-Efficient Extended Pre-Training to Adapt Vision Transformers under Domain Shifts [52.1635661239108]
We introduce ExPLoRA, a highly effective technique to improve transfer learning of pre-trained vision transformers (ViTs) under domain shifts.
Our experiments demonstrate state-of-the-art results on satellite imagery, even outperforming fully pre-training and fine-tuning ViTs.
arXiv Detail & Related papers (2024-06-16T15:14:56Z) - Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners [19.579098962615795]
Few-Shot Class Incremental Learning (FSCIL) is a task that requires a model to learn new classes incrementally without forgetting when only a few samples for each class are given.
FSCIL encounters two significant challenges: catastrophic forgetting and overfitting.
We argue that large models such as vision and language transformers pre-trained on large datasets can be excellent few-shot incremental learners.
arXiv Detail & Related papers (2024-04-02T17:23:22Z) - Experts Weights Averaging: A New General Training Scheme for Vision
Transformers [57.62386892571636]
We propose a training scheme for Vision Transformers (ViTs) that achieves performance improvement without increasing inference cost.
During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs.
After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference.
arXiv Detail & Related papers (2023-08-11T12:05:12Z) - DeiT III: Revenge of the ViT [56.46810490275699]
A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks.
Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT.
arXiv Detail & Related papers (2022-04-14T17:13:44Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - Bootstrapping ViTs: Towards Liberating Vision Transformers from
Pre-training [29.20567759071523]
Vision Transformers (ViTs) are developing rapidly and starting to challenge the domination of convolutional neural networks (CNNs) in computer vision.
This paper introduces CNNs' inductive biases back to ViTs while preserving their network architectures for higher upper bound.
Experiments on CIFAR-10/100 and ImageNet-1k with limited training data have shown encouraging results.
arXiv Detail & Related papers (2021-12-07T07:56:50Z) - Chasing Sparsity in Vision Transformers: An End-to-End Exploration [127.10054032751714]
Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting.
This paper aims to trim down both the training memory overhead and the inference complexity, without sacrificing the achievable accuracy.
Specifically, instead of training full ViTs, we dynamically extract and train sparseworks, while sticking to a fixed small parameter budget.
arXiv Detail & Related papers (2021-06-08T17:18:00Z) - When Vision Transformers Outperform ResNets without Pretraining or
Strong Data Augmentations [111.44860506703307]
Vision Transformers (ViTs) and existing VisionNets signal efforts on replacing hand-wired features or inductive throughputs with general-purpose neural architectures.
This paper investigates ViTs and Res-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and inference.
We show that the improved robustness attributes to sparser active neurons in the first few layers.
The resultant ViTs outperform Nets of similar size and smoothness when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
arXiv Detail & Related papers (2021-06-03T02:08:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.