Less is More: On the Feature Redundancy of Pretrained Models When
Transferring to Few-shot Tasks
- URL: http://arxiv.org/abs/2310.03843v1
- Date: Thu, 5 Oct 2023 19:00:49 GMT
- Title: Less is More: On the Feature Redundancy of Pretrained Models When
Transferring to Few-shot Tasks
- Authors: Xu Luo, Difan Zou, Lianli Gao, Zenglin Xu, Jingkuan Song
- Abstract summary: Transferring a pretrained model to a downstream task can be as easy as conducting linear probing with target data.
We show that, for linear probing, the pretrained features can be extremely redundant when the downstream data is scarce.
- Score: 120.23328563831704
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transferring a pretrained model to a downstream task can be as easy as
conducting linear probing with target data, that is, training a linear
classifier upon frozen features extracted from the pretrained model. As there
may exist significant gaps between pretraining and downstream datasets, one may
ask whether all dimensions of the pretrained features are useful for a given
downstream task. We show that, for linear probing, the pretrained features can
be extremely redundant when the downstream data is scarce, or few-shot. For
some cases such as 5-way 1-shot tasks, using only 1\% of the most important
feature dimensions is able to recover the performance achieved by using the
full representation. Interestingly, most dimensions are redundant only under
few-shot settings and gradually become useful when the number of shots
increases, suggesting that feature redundancy may be the key to characterizing
the "few-shot" nature of few-shot transfer problems. We give a theoretical
understanding of this phenomenon and show how dimensions with high variance and
small distance between class centroids can serve as confounding factors that
severely disturb classification results under few-shot settings. As an attempt
at solving this problem, we find that the redundant features are difficult to
identify accurately with a small number of training samples, but we can instead
adjust feature magnitude with a soft mask based on estimated feature
importance. We show that this method can generally improve few-shot transfer
performance across various pretrained models and downstream datasets.
Related papers
- These Are Not All the Features You Are Looking For: A Fundamental Bottleneck in Supervised Pretraining [10.749875317643031]
Transfer learning is a cornerstone of modern machine learning, promising a way to adapt models pretrained on a broad mix of data to new tasks with minimal new data.<n>We evaluate model transfer from a pretraining mixture to each of its component tasks, assessing whether pretrained features can match the performance of task-specific direct training.<n>We identify a fundamental limitation in deep learning models, where networks fail to learn new features once they encode similar competing features during training.
arXiv Detail & Related papers (2025-06-23T01:04:29Z) - Elastic Representation: Mitigating Spurious Correlations for Group Robustness [24.087096334524077]
Deep learning models can suffer from severe performance degradation when relying on spurious correlations between input features and labels.<n>We propose Elastic Representation (ElRep) to learn features by imposing Nuclear- and Frobenius-norm penalties on the representation from the last layer of a neural network.
arXiv Detail & Related papers (2025-02-14T01:25:27Z) - Robust Representation Consistency Model via Contrastive Denoising [83.47584074390842]
randomized smoothing provides theoretical guarantees for certifying robustness against adversarial perturbations.<n> diffusion models have been successfully employed for randomized smoothing to purify noise-perturbed samples.<n>We reformulate the generative modeling task along the diffusion trajectories in pixel space as a discriminative task in the latent space.
arXiv Detail & Related papers (2025-01-22T18:52:06Z) - On the Connection between Pre-training Data Diversity and Fine-tuning
Robustness [66.30369048726145]
We find that the primary factor influencing downstream effective robustness is data quantity.
We demonstrate our findings on pre-training distributions drawn from various natural and synthetic data sources.
arXiv Detail & Related papers (2023-07-24T05:36:19Z) - Task-Robust Pre-Training for Worst-Case Downstream Adaptation [62.05108162160981]
Pre-training has achieved remarkable success when transferred to downstream tasks.
This paper considers pre-training a model that guarantees a uniformly good performance over the downstream tasks.
arXiv Detail & Related papers (2023-06-21T07:43:23Z) - The Double-Edged Sword of Implicit Bias: Generalization vs. Robustness
in ReLU Networks [64.12052498909105]
We study the implications of the implicit bias of gradient flow on generalization and adversarial robustness in ReLU networks.
In two-layer ReLU networks gradient flow is biased towards solutions that generalize well, but are highly vulnerable to adversarial examples.
arXiv Detail & Related papers (2023-03-02T18:14:35Z) - Optimal transfer protocol by incremental layer defrosting [66.76153955485584]
Transfer learning is a powerful tool enabling model training with limited amounts of data.
The simplest transfer learning protocol is based on freezing" the feature-extractor layers of a network pre-trained on a data-rich source task.
We show that this protocol is often sub-optimal and the largest performance gain may be achieved when smaller portions of the pre-trained network are kept frozen.
arXiv Detail & Related papers (2023-03-02T17:32:11Z) - Overcoming Simplicity Bias in Deep Networks using a Feature Sieve [5.33024001730262]
We propose a direct, interventional method for addressing simplicity bias in deep networks.
We aim to automatically identify and suppress easily-computable spurious features in lower layers of the network.
We report substantial gains on many real-world debiasing benchmarks.
arXiv Detail & Related papers (2023-01-30T21:11:13Z) - On Measuring the Intrinsic Few-Shot Hardness of Datasets [49.37562545777455]
We show that few-shot hardness may be intrinsic to datasets, for a given pre-trained model.
We propose a simple and lightweight metric called "Spread" that captures the intuition that few-shot learning is made possible.
Our metric better accounts for few-shot hardness compared to existing notions of hardness, and is 8-100x faster to compute.
arXiv Detail & Related papers (2022-11-16T18:53:52Z) - Learning an Invertible Output Mapping Can Mitigate Simplicity Bias in
Neural Networks [66.76034024335833]
We investigate why diverse/ complex features are learned by the backbone, and their brittleness is due to the linear classification head relying primarily on the simplest features.
We propose Feature Reconstruction Regularizer (FRR) to ensure that the learned features can be reconstructed back from the logits.
We demonstrate up to 15% gains in OOD accuracy on the recently introduced semi-synthetic datasets with extreme distribution shifts.
arXiv Detail & Related papers (2022-10-04T04:01:15Z) - Task-Customized Self-Supervised Pre-training with Scalable Dynamic
Routing [76.78772372631623]
A common practice for self-supervised pre-training is to use as much data as possible.
For a specific downstream task, however, involving irrelevant data in pre-training may degenerate the downstream performance.
It is burdensome and infeasible to use different downstream-task-customized datasets in pre-training for different tasks.
arXiv Detail & Related papers (2022-05-26T10:49:43Z) - Revisiting the Updates of a Pre-trained Model for Few-shot Learning [11.871523410051527]
We compare the two popular updating methods, fine-tuning and linear probing.
We find that fine-tuning is better than linear probing as the number of samples increases.
arXiv Detail & Related papers (2022-05-13T08:47:06Z) - GDC- Generalized Distribution Calibration for Few-Shot Learning [5.076419064097734]
Few shot learning is an important problem in machine learning as large labelled datasets take considerable time and effort to assemble.
Most few-shot learning algorithms suffer from one of two limitations- they either require the design of sophisticated models and loss functions, thus hampering interpretability.
We propose a Generalized sampling method that learns to estimate few-shot distributions for classification as weighted random variables of all large classes.
arXiv Detail & Related papers (2022-04-11T16:22:53Z) - High-Robustness, Low-Transferability Fingerprinting of Neural Networks [78.2527498858308]
This paper proposes Characteristic Examples for effectively fingerprinting deep neural networks.
It features high-robustness to the base model against model pruning as well as low-transferability to unassociated models.
arXiv Detail & Related papers (2021-05-14T21:48:23Z) - Why Do Better Loss Functions Lead to Less Transferable Features? [93.47297944685114]
This paper studies how the choice of training objective affects the transferability of the hidden representations of convolutional neural networks trained on ImageNet.
We show that many objectives lead to statistically significant improvements in ImageNet accuracy over vanilla softmax cross-entropy, but the resulting fixed feature extractors transfer substantially worse to downstream tasks.
arXiv Detail & Related papers (2020-10-30T17:50:31Z) - Grow-Push-Prune: aligning deep discriminants for effective structural
network compression [5.532477732693]
This paper attempts to derive task-dependent compact models from a deep discriminant analysis perspective.
We propose an iterative and proactive approach for classification tasks which alternates between a pushing step and a pruning step.
Experiments on the MNIST, CIFAR10, and ImageNet datasets demonstrate our approach's efficacy.
arXiv Detail & Related papers (2020-09-29T01:29:23Z) - Prevention is Better than Cure: Handling Basis Collapse and Transparency
in Dense Networks [0.0]
We identify a basis collapse issue as a primary cause and propose a modified loss function that circumvents this problem.
We demonstrate through carefully chosen numerical experiments that the basis collapse issue leads to the design of massively redundant networks.
Our approach results in substantially concise nets, having $100 times$ fewer parameters, while achieving a much lower $(10times)$ MSE loss at scale than reported in prior works.
arXiv Detail & Related papers (2020-08-22T17:09:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.