Less is More: On the Feature Redundancy of Pretrained Models When
Transferring to Few-shot Tasks
- URL: http://arxiv.org/abs/2310.03843v1
- Date: Thu, 5 Oct 2023 19:00:49 GMT
- Title: Less is More: On the Feature Redundancy of Pretrained Models When
Transferring to Few-shot Tasks
- Authors: Xu Luo, Difan Zou, Lianli Gao, Zenglin Xu, Jingkuan Song
- Abstract summary: Transferring a pretrained model to a downstream task can be as easy as conducting linear probing with target data.
We show that, for linear probing, the pretrained features can be extremely redundant when the downstream data is scarce.
- Score: 120.23328563831704
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transferring a pretrained model to a downstream task can be as easy as
conducting linear probing with target data, that is, training a linear
classifier upon frozen features extracted from the pretrained model. As there
may exist significant gaps between pretraining and downstream datasets, one may
ask whether all dimensions of the pretrained features are useful for a given
downstream task. We show that, for linear probing, the pretrained features can
be extremely redundant when the downstream data is scarce, or few-shot. For
some cases such as 5-way 1-shot tasks, using only 1\% of the most important
feature dimensions is able to recover the performance achieved by using the
full representation. Interestingly, most dimensions are redundant only under
few-shot settings and gradually become useful when the number of shots
increases, suggesting that feature redundancy may be the key to characterizing
the "few-shot" nature of few-shot transfer problems. We give a theoretical
understanding of this phenomenon and show how dimensions with high variance and
small distance between class centroids can serve as confounding factors that
severely disturb classification results under few-shot settings. As an attempt
at solving this problem, we find that the redundant features are difficult to
identify accurately with a small number of training samples, but we can instead
adjust feature magnitude with a soft mask based on estimated feature
importance. We show that this method can generally improve few-shot transfer
performance across various pretrained models and downstream datasets.
Related papers
- On the Connection between Pre-training Data Diversity and Fine-tuning
Robustness [66.30369048726145]
We find that the primary factor influencing downstream effective robustness is data quantity.
We demonstrate our findings on pre-training distributions drawn from various natural and synthetic data sources.
arXiv Detail & Related papers (2023-07-24T05:36:19Z) - Task-Robust Pre-Training for Worst-Case Downstream Adaptation [62.05108162160981]
Pre-training has achieved remarkable success when transferred to downstream tasks.
This paper considers pre-training a model that guarantees a uniformly good performance over the downstream tasks.
arXiv Detail & Related papers (2023-06-21T07:43:23Z) - Optimal transfer protocol by incremental layer defrosting [66.76153955485584]
Transfer learning is a powerful tool enabling model training with limited amounts of data.
The simplest transfer learning protocol is based on freezing" the feature-extractor layers of a network pre-trained on a data-rich source task.
We show that this protocol is often sub-optimal and the largest performance gain may be achieved when smaller portions of the pre-trained network are kept frozen.
arXiv Detail & Related papers (2023-03-02T17:32:11Z) - On Measuring the Intrinsic Few-Shot Hardness of Datasets [49.37562545777455]
We show that few-shot hardness may be intrinsic to datasets, for a given pre-trained model.
We propose a simple and lightweight metric called "Spread" that captures the intuition that few-shot learning is made possible.
Our metric better accounts for few-shot hardness compared to existing notions of hardness, and is 8-100x faster to compute.
arXiv Detail & Related papers (2022-11-16T18:53:52Z) - Task-Customized Self-Supervised Pre-training with Scalable Dynamic
Routing [76.78772372631623]
A common practice for self-supervised pre-training is to use as much data as possible.
For a specific downstream task, however, involving irrelevant data in pre-training may degenerate the downstream performance.
It is burdensome and infeasible to use different downstream-task-customized datasets in pre-training for different tasks.
arXiv Detail & Related papers (2022-05-26T10:49:43Z) - Revisiting the Updates of a Pre-trained Model for Few-shot Learning [11.871523410051527]
We compare the two popular updating methods, fine-tuning and linear probing.
We find that fine-tuning is better than linear probing as the number of samples increases.
arXiv Detail & Related papers (2022-05-13T08:47:06Z) - GDC- Generalized Distribution Calibration for Few-Shot Learning [5.076419064097734]
Few shot learning is an important problem in machine learning as large labelled datasets take considerable time and effort to assemble.
Most few-shot learning algorithms suffer from one of two limitations- they either require the design of sophisticated models and loss functions, thus hampering interpretability.
We propose a Generalized sampling method that learns to estimate few-shot distributions for classification as weighted random variables of all large classes.
arXiv Detail & Related papers (2022-04-11T16:22:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.