What Happens During Finetuning of Vision Transformers: An Invariance
Based Investigation
- URL: http://arxiv.org/abs/2307.06006v1
- Date: Wed, 12 Jul 2023 08:35:24 GMT
- Title: What Happens During Finetuning of Vision Transformers: An Invariance
Based Investigation
- Authors: Gabriele Merlin, Vedant Nanda, Ruchit Rawal, Mariya Toneva
- Abstract summary: The pretrain-finetune paradigm usually improves downstream performance over training a model from scratch on the same task.
In this work, we examine the relationship between pretrained vision transformers and the corresponding finetuned versions on several benchmark datasets and tasks.
- Score: 7.432224771219168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The pretrain-finetune paradigm usually improves downstream performance over
training a model from scratch on the same task, becoming commonplace across
many areas of machine learning. While pretraining is empirically observed to be
beneficial for a range of tasks, there is not a clear understanding yet of the
reasons for this effect. In this work, we examine the relationship between
pretrained vision transformers and the corresponding finetuned versions on
several benchmark datasets and tasks. We present new metrics that specifically
investigate the degree to which invariances learned by a pretrained model are
retained or forgotten during finetuning. Using these metrics, we present a
suite of empirical findings, including that pretraining induces transferable
invariances in shallow layers and that invariances from deeper pretrained
layers are compressed towards shallower layers during finetuning. Together,
these findings contribute to understanding some of the reasons for the
successes of pretrained models and the changes that a pretrained model
undergoes when finetuned on a downstream task.
Related papers
- SINDER: Repairing the Singular Defects of DINOv2 [61.98878352956125]
Vision Transformer models trained on large-scale datasets often exhibit artifacts in the patch token they extract.
We propose a novel fine-tuning smooth regularization that rectifies structural deficiencies using only a small dataset.
arXiv Detail & Related papers (2024-07-23T20:34:23Z) - Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach [87.8330887605381]
We show how to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters.
We synthesize a task-specific query with a learnable and lightweight module, which is independent of the pre-trained model.
Our method achieves state-of-the-art performance under memory constraints, showcasing its applicability in real-world situations.
arXiv Detail & Related papers (2024-07-09T15:45:04Z) - Inverse Dynamics Pretraining Learns Good Representations for Multitask
Imitation [66.86987509942607]
We evaluate how such a paradigm should be done in imitation learning.
We consider a setting where the pretraining corpus consists of multitask demonstrations.
We argue that inverse dynamics modeling is well-suited to this setting.
arXiv Detail & Related papers (2023-05-26T14:40:46Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - Probing Representation Forgetting in Supervised and Unsupervised
Continual Learning [14.462797749666992]
Catastrophic forgetting is associated with an abrupt loss of knowledge previously learned by a model.
We show that representation forgetting can lead to new insights on the effect of model capacity and loss function used in continual learning.
arXiv Detail & Related papers (2022-03-24T23:06:08Z) - An Empirical Investigation of the Role of Pre-training in Lifelong
Learning [21.995593026269578]
We show that generic pre-training implicitly alleviates the effects of catastrophic forgetting when learning multiple tasks sequentially.
We study this phenomenon by analyzing the loss landscape, finding that pre-trained weights appear to ease forgetting by leading to wider minima.
arXiv Detail & Related papers (2021-12-16T19:00:55Z) - Pre-training also Transfers Non-Robustness [20.226917627173126]
In spite of its recognized contribution to generalization, pre-training also transfers the non-robustness from pre-trained model into the fine-tuned model.
Results validate the effectiveness in alleviating non-robustness and preserving generalization.
arXiv Detail & Related papers (2021-06-21T11:16:13Z) - Reducing Representation Drift in Online Continual Learning [87.71558506591937]
We study the online continual learning paradigm, where agents must learn from a changing distribution with constrained memory and compute.
In this work we instead focus on the change in representations of previously observed data due to the introduction of previously unobserved class samples in the incoming data stream.
arXiv Detail & Related papers (2021-04-11T15:19:30Z) - On the Interplay Between Fine-tuning and Sentence-level Probing for
Linguistic Knowledge in Pre-trained Transformers [24.858283637038422]
We study three different pre-trained models: BERT, RoBERTa, and ALBERT.
We find that for some probing tasks fine-tuning leads to substantial changes in accuracy.
While fine-tuning indeed changes the representations of a pre-trained model, only in very few cases, fine-tuning has a positive effect on probing accuracy.
arXiv Detail & Related papers (2020-10-06T10:54:00Z) - Multi-Stage Influence Function [97.19210942277354]
We develop a multi-stage influence function score to track predictions from a finetuned model all the way back to the pretraining data.
We study two different scenarios with the pretrained embeddings fixed or updated in the finetuning tasks.
arXiv Detail & Related papers (2020-07-17T16:03:11Z) - Investigating Transferability in Pretrained Language Models [8.83046338075119]
We consider a simple ablation technique for determining the impact of each pretrained layer on transfer task performance.
This technique reveals that in BERT, layers with high probing performance on downstream GLUE tasks are neither necessary nor sufficient for high accuracy on those tasks.
arXiv Detail & Related papers (2020-04-30T17:23:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.