Newer is not always better: Rethinking transferability metrics, their
peculiarities, stability and performance
- URL: http://arxiv.org/abs/2110.06893v3
- Date: Fri, 26 May 2023 15:45:37 GMT
- Title: Newer is not always better: Rethinking transferability metrics, their
peculiarities, stability and performance
- Authors: Shibal Ibrahim, Natalia Ponomareva, Rahul Mazumder
- Abstract summary: Fine-tuning of large pre-trained image and language models on small customized datasets has become increasingly popular.
We show that the statistical problems with covariance estimation drive the poor performance of H-score.
We propose a correction and recommend measuring correlation performance against relative accuracy in such settings.
- Score: 5.650647159993238
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning of large pre-trained image and language models on small
customized datasets has become increasingly popular for improved prediction and
efficient use of limited resources. Fine-tuning requires identification of best
models to transfer-learn from and quantifying transferability prevents
expensive re-training on all of the candidate models/tasks pairs. In this
paper, we show that the statistical problems with covariance estimation drive
the poor performance of H-score -- a common baseline for newer metrics -- and
propose shrinkage-based estimator. This results in up to 80% absolute gain in
H-score correlation performance, making it competitive with the
state-of-the-art LogME measure. Our shrinkage-based H-score is
$3\times$-10$\times$ faster to compute compared to LogME. Additionally, we look
into a less common setting of target (as opposed to source) task selection. We
demonstrate previously overlooked problems in such settings with different
number of labels, class-imbalance ratios etc. for some recent metrics e.g.,
NCE, LEEP that resulted in them being misrepresented as leading measures. We
propose a correction and recommend measuring correlation performance against
relative accuracy in such settings. We support our findings with ~164,000
(fine-tuning trials) experiments on both vision models and graph neural
networks.
Related papers
- Typicalness-Aware Learning for Failure Detection [26.23185979968123]
Deep neural networks (DNNs) often suffer from the overconfidence issue, where incorrect predictions are made with high confidence scores.
We propose a novel approach called Typicalness-Aware Learning (TAL) to address this issue and improve failure detection performance.
arXiv Detail & Related papers (2024-11-04T11:09:47Z) - Electroencephalogram Emotion Recognition via AUC Maximization [0.0]
Imbalanced datasets pose significant challenges in areas including neuroscience, cognitive science, and medical diagnostics.
This study addresses the issue class imbalance, using the Liking' label in the DEAP dataset as an example.
arXiv Detail & Related papers (2024-08-16T19:08:27Z) - PUMA: margin-based data pruning [51.12154122266251]
We focus on data pruning, where some training samples are removed based on the distance to the model classification boundary (i.e., margin)
We propose PUMA, a new data pruning strategy that computes the margin using DeepFool.
We show that PUMA can be used on top of the current state-of-the-art methodology in robustness, and it is able to significantly improve the model performance unlike the existing data pruning strategies.
arXiv Detail & Related papers (2024-05-10T08:02:20Z) - Stabilizing Subject Transfer in EEG Classification with Divergence
Estimation [17.924276728038304]
We propose several graphical models to describe an EEG classification task.
We identify statistical relationships that should hold true in an idealized training scenario.
We design regularization penalties to enforce these relationships in two stages.
arXiv Detail & Related papers (2023-10-12T23:06:52Z) - Learning Sample Difficulty from Pre-trained Models for Reliable
Prediction [55.77136037458667]
We propose to utilize large-scale pre-trained models to guide downstream model training with sample difficulty-aware entropy regularization.
We simultaneously improve accuracy and uncertainty calibration across challenging benchmarks.
arXiv Detail & Related papers (2023-04-20T07:29:23Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - UQ-ARMED: Uncertainty quantification of adversarially-regularized mixed
effects deep learning for clustered non-iid data [0.6719751155411076]
This work demonstrates the ability to produce readily interpretable statistical metrics for model fit, fixed effects covariance coefficients, and prediction confidence.
In our experiment for AD prognosis, not only do the UQ methods provide these benefits, but several UQ methods maintain the high performance of the original ARMED method.
arXiv Detail & Related papers (2022-11-29T02:50:48Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Evaluating Prediction-Time Batch Normalization for Robustness under
Covariate Shift [81.74795324629712]
We call prediction-time batch normalization, which significantly improves model accuracy and calibration under covariate shift.
We show that prediction-time batch normalization provides complementary benefits to existing state-of-the-art approaches for improving robustness.
The method has mixed results when used alongside pre-training, and does not seem to perform as well under more natural types of dataset shift.
arXiv Detail & Related papers (2020-06-19T05:08:43Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.