Fine-Tuning can Distort Pretrained Features and Underperform
Out-of-Distribution
- URL: http://arxiv.org/abs/2202.10054v1
- Date: Mon, 21 Feb 2022 09:03:34 GMT
- Title: Fine-Tuning can Distort Pretrained Features and Underperform
Out-of-Distribution
- Authors: Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, Percy Liang
- Abstract summary: Fine-tuning can achieve worse accuracy than linear probing when the pretrained features are good and the distribution shift is large.
We show theoretically that this tradeoff between ID and OOD accuracy arises even in a simple setting.
Our analysis suggests that the easy two-step strategy of linear probing then full fine-tuning combines the benefits of both fine-tuning and linear probing.
- Score: 100.01469697743322
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When transferring a pretrained model to a downstream task, two popular
methods are full fine-tuning (updating all the model parameters) and linear
probing (updating only the last linear layer -- the "head"). It is well known
that fine-tuning leads to better accuracy in-distribution (ID). However, in
this paper, we find that fine-tuning can achieve worse accuracy than linear
probing out-of-distribution (OOD) when the pretrained features are good and the
distribution shift is large. On 10 distribution shift datasets
(Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR $\to$ STL, CIFAR10.1, FMoW,
ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch), fine-tuning obtains on
average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. We
show theoretically that this tradeoff between ID and OOD accuracy arises even
in a simple setting: fine-tuning overparameterized two-layer linear networks.
We prove that the OOD error of fine-tuning is high when we initialize with a
fixed or random head -- this is because while fine-tuning learns the head, the
lower layers of the neural network change simultaneously and distort the
pretrained features. Our analysis suggests that the easy two-step strategy of
linear probing then full fine-tuning (LP-FT), sometimes used as a fine-tuning
heuristic, combines the benefits of both fine-tuning and linear probing.
Empirically, LP-FT outperforms both fine-tuning and linear probing on the above
datasets (1% better ID, 10% better OOD than full fine-tuning).
Related papers
- Understanding Linear Probing then Fine-tuning Language Models from NTK Perspective [32.01426831450348]
The two-stage fine-tuning (FT) method, linear probing (LP) then fine-tuning (LP-FT), outperforms linear probing and FT alone.
We analyze the training dynamics of LP-FT for classification tasks on the basis of the neural tangent kernel (NTK) theory.
Our study demonstrates the effectiveness of LP-FT for fine-tuning language models.
arXiv Detail & Related papers (2024-05-27T01:31:40Z) - AutoFT: Learning an Objective for Robust Fine-Tuning [60.641186718253735]
Foundation models encode rich representations that can be adapted to downstream tasks by fine-tuning.
Current approaches to robust fine-tuning use hand-crafted regularization techniques.
We propose AutoFT, a data-driven approach for robust fine-tuning.
arXiv Detail & Related papers (2024-01-18T18:58:49Z) - Towards Calibrated Robust Fine-Tuning of Vision-Language Models [97.19901765814431]
This work proposes a robust fine-tuning method that improves both OOD accuracy and confidence calibration simultaneously in vision language models.
We show that both OOD classification and OOD calibration errors have a shared upper bound consisting of two terms of ID data.
Based on this insight, we design a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value.
arXiv Detail & Related papers (2023-11-03T05:41:25Z) - Neural Priming for Sample-Efficient Adaptation [92.14357804106787]
We propose Neural Priming, a technique for adapting large pretrained models to distribution shifts and downstream tasks.
Neural Priming can be performed at test time, even for pretraining as large as LAION-2B.
arXiv Detail & Related papers (2023-06-16T21:53:16Z) - Trainable Projected Gradient Method for Robust Fine-tuning [36.470333094917436]
We propose Trainable Projected Gradient Method (TPGM) to automatically learn the constraint imposed for each layer for a fine-grained fine-tuning regularization.
This is motivated by formulating fine-tuning as a bi-level constrained optimization problem.
We show that TPGM outperforms existing fine-tuning methods in OOD performance while matching the best in-distribution (ID) performance.
arXiv Detail & Related papers (2023-03-19T17:30:44Z) - Scaling & Shifting Your Features: A New Baseline for Efficient Model
Tuning [126.84770886628833]
Existing finetuning methods either tune all parameters of the pretrained model (full finetuning) or only tune the last linear layer (linear probing)
We propose a new parameter-efficient finetuning method termed as SSF, representing that researchers only need to Scale and Shift the deep Features extracted by a pre-trained model to catch up with the performance full finetuning.
arXiv Detail & Related papers (2022-10-17T08:14:49Z) - LQF: Linear Quadratic Fine-Tuning [114.3840147070712]
We present the first method for linearizing a pre-trained model that achieves comparable performance to non-linear fine-tuning.
LQF consists of simple modifications to the architecture, loss function and optimization typically used for classification.
arXiv Detail & Related papers (2020-12-21T06:40:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.