Related papers: Universality in Transfer Learning for Linear Models

Universality in Transfer Learning for Linear Models

URL: http://arxiv.org/abs/2410.02164v1
Date: Thu, 3 Oct 2024 03:09:09 GMT
Title: Universality in Transfer Learning for Linear Models
Authors: Reza Ghane, Danil Akhtiamov, Babak Hassibi,
Abstract summary: We study the problem of transfer learning in linear models for both regression and binary classification. We provide an exact and rigorous analysis and relate generalization errors (in regression) and classification errors (in binary classification) for the pretrained and fine-tuned models.
Score: 18.427215139020625
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transfer learning is an attractive framework for problems where there is a paucity of data, or where data collection is costly. One common approach to transfer learning is referred to as "model-based", and involves using a model that is pretrained on samples from a source distribution, which is easier to acquire, and then fine-tuning the model on a few samples from the target distribution. The hope is that, if the source and target distributions are ``close", then the fine-tuned model will perform well on the target distribution even though it has seen only a few samples from it. In this work, we study the problem of transfer learning in linear models for both regression and binary classification. In particular, we consider the use of stochastic gradient descent (SGD) on a linear model initialized with pretrained weights and using a small training data set from the target distribution. In the asymptotic regime of large models, we provide an exact and rigorous analysis and relate the generalization errors (in regression) and classification errors (in binary classification) for the pretrained and fine-tuned models. In particular, we give conditions under which the fine-tuned model outperforms the pretrained one. An important aspect of our work is that all the results are "universal", in the sense that they depend only on the first and second order statistics of the target distribution. They thus extend well beyond the standard Gaussian assumptions commonly made in the literature.

Related papers

Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z)
Scaling and renormalization in high-dimensional regression [72.59731158970894]
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models. We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning.
arXiv Detail & Related papers (2024-05-01T15:59:00Z)
Ask Your Distribution Shift if Pre-Training is Right for You [74.18516460467019]
In practice, fine-tuning a pre-trained model improves robustness significantly in some cases but not at all in others. We focus on two possible failure modes of models under distribution shift: poor extrapolation and biases in the training data. Our study suggests that, as a rule of thumb, pre-training can help mitigate poor extrapolation but not dataset biases.
arXiv Detail & Related papers (2024-02-29T23:46:28Z)
Theoretical Characterization of the Generalization Performance of Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features. We find new and interesting properties that do not exist in single-task linear regression. Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z)
TWINS: A Fine-Tuning Framework for Improved Transferability of Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks. We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework. TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z)
Adaptive Distribution Calibration for Few-Shot Learning with Hierarchical Optimal Transport [78.9167477093745]
We propose a novel distribution calibration method by learning the adaptive weight matrix between novel samples and base classes. Experimental results on standard benchmarks demonstrate that our proposed plug-and-play model outperforms competing approaches.
arXiv Detail & Related papers (2022-10-09T02:32:57Z)
Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis. We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z)
Certifying Data-Bias Robustness in Linear Regression [12.00314910031517]
We present a technique for certifying whether linear regression models are pointwise-robust to label bias in a training dataset. We show how to solve this problem exactly for individual test points, and provide an approximate but more scalable method. We also unearth gaps in bias-robustness, such as high levels of non-robustness for certain bias assumptions on some datasets.
arXiv Detail & Related papers (2022-06-07T20:47:07Z)
Revisiting the Updates of a Pre-trained Model for Few-shot Learning [11.871523410051527]
We compare the two popular updating methods, fine-tuning and linear probing. We find that fine-tuning is better than linear probing as the number of samples increases.
arXiv Detail & Related papers (2022-05-13T08:47:06Z)
Continuously Generalized Ordinal Regression for Linear and Deep Models [41.03778663275373]
Ordinal regression is a classification task where classes have an order and prediction error increases the further the predicted class is from the true class. We propose a new approach for modeling ordinal data that allows class-specific hyperplane slopes. Our method significantly outperforms the standard ordinal logistic model over a thorough set of ordinal regression benchmark datasets.
arXiv Detail & Related papers (2022-02-14T19:49:05Z)
General Greedy De-bias Learning [163.65789778416172]
We propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model like gradient descent in functional space. GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
arXiv Detail & Related papers (2021-12-20T14:47:32Z)
How to Learn when Data Gradually Reacts to Your Model [10.074466859579571]
We propose a new algorithm, Stateful Performative Gradient Descent (Stateful PerfGD), for minimizing the performative loss even in the presence of these effects. Our experiments confirm that Stateful PerfGD substantially outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2021-12-13T22:05:26Z)
Training Experimentally Robust and Interpretable Binarized Regression Models Using Mixed-Integer Programming [3.179831861897336]
We present a model-based approach to training robust and interpretable binarized regression models for multiclass classification tasks. Our MIP model balances the optimization of prediction margin and model size by using a weighted objective. We show the effectiveness of training robust and interpretable binarized regression models using MIP.
arXiv Detail & Related papers (2021-12-01T11:53:08Z)
Sampling from Arbitrary Functions via PSD Models [55.41644538483948]
We take a two-step approach by first modeling the probability distribution and then sampling from that model. We show that these models can approximate a large class of densities concisely using few evaluations, and present a simple algorithm to effectively sample from these models.
arXiv Detail & Related papers (2021-10-20T12:25:22Z)
Variation-Incentive Loss Re-weighting for Regression Analysis on Biased Data [8.115323786541078]
We aim to improve the accuracy of the regression analysis by addressing the data skewness/bias during model training. We propose a Variation-Incentive Loss re-weighting method (VILoss) to optimize the gradient descent-based model training for regression analysis.
arXiv Detail & Related papers (2021-09-14T10:22:21Z)
Gradual Domain Adaptation in the Wild:When Intermediate Distributions are Absent [32.906658998929394]
We focus on the problem of domain adaptation when the goal is shifting the model towards the target distribution. We propose GIFT, a method that creates virtual samples from intermediate distributions by interpolating representations of examples from source and target domains.
arXiv Detail & Related papers (2021-06-10T22:47:06Z)
Why do classifier accuracies show linear trends under distribution shift? [58.40438263312526]
accuracies of models on one data distribution are approximately linear functions of the accuracies on another distribution. We assume the probability that two models agree in their predictions is higher than what we can infer from their accuracy levels alone. We show that a linear trend must occur when evaluating models on two distributions unless the size of the distribution shift is large.
arXiv Detail & Related papers (2020-12-31T07:24:30Z)
Testing for Typicality with Respect to an Ensemble of Learned Distributions [5.850572971372637]
One-sample approaches to the goodness-of-fit problem offer significant computational advantages for online testing. The ability to correctly reject anomalous data in this setting hinges on the accuracy of the model of the base distribution. Existing methods for the one-sample goodness-of-fit problem do not account for the fact that a model of the base distribution is learned. We propose training an ensemble of density models, considering data to be anomalous if the data is anomalous with respect to any member of the ensemble.
arXiv Detail & Related papers (2020-11-11T19:47:46Z)
Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers. We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model. Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
Generalization Error of Generalized Linear Models in High Dimensions [25.635225717360466]
We provide a framework to characterize neural networks with arbitrary non-linearities. We analyze the effect of regular logistic regression on learning. Our model also captures examples between training and distributions special cases.
arXiv Detail & Related papers (2020-05-01T02:17:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.