Foundation Model's Embedded Representations May Detect Distribution
Shift
- URL: http://arxiv.org/abs/2310.13836v2
- Date: Fri, 2 Feb 2024 18:07:37 GMT
- Title: Foundation Model's Embedded Representations May Detect Distribution
Shift
- Authors: Max Vargas, Adam Tsou, Andrew Engel, Tony Chiang
- Abstract summary: We present a case study for transfer learning tasks on the Sentiment140 dataset.
We show that many pre-trained foundation models encode different representations of Sentiment140's manually curated test set $M$ from the automatically labeled training set $P$.
We argue training on $P$ and measuring performance on $M$ is a biased measure of generalization.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sampling biases can cause distribution shifts between train and test datasets
for supervised learning tasks, obscuring our ability to understand the
generalization capacity of a model. This is especially important considering
the wide adoption of pre-trained foundational neural networks -- whose behavior
remains poorly understood -- for transfer learning (TL) tasks. We present a
case study for TL on the Sentiment140 dataset and show that many pre-trained
foundation models encode different representations of Sentiment140's manually
curated test set $M$ from the automatically labeled training set $P$,
confirming that a distribution shift has occurred. We argue training on $P$ and
measuring performance on $M$ is a biased measure of generalization. Experiments
on pre-trained GPT-2 show that the features learnable from $P$ do not improve
(and in fact hamper) performance on $M$. Linear probes on pre-trained GPT-2's
representations are robust and may even outperform overall fine-tuning,
implying a fundamental importance for discerning distribution shift in
train/test splits for model interpretation.
Related papers
- Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models [3.207886496235499]
We study the process through which reasoning models trained with reinforcement learning on verifiable rewards (RLVR) can learn to solve new problems.<n>We find that RLVR drives performance in two main ways: (1) by compressing pass@$k$ into pass@1 and (2) via "capability gain" in which models learn to solve new problems that they previously could not solve even at high $k$.
arXiv Detail & Related papers (2025-06-16T19:03:06Z) - Towards Better Understanding of In-Context Learning Ability from In-Context Uncertainty Quantification [7.869708570399577]
We consider a bi-objective prediction task of predicting both the conditional expectation $mathbbE[Y|X]$ and the conditional variance Var$(Y|X)$.
Theoretically, we show that the trained Transformer reaches near Bayes-optimum, suggesting the usage of the information of the training distribution.
arXiv Detail & Related papers (2024-05-24T00:08:55Z) - Fairness Hub Technical Briefs: Definition and Detection of Distribution Shift [0.5825410941577593]
Distribution shift is a common situation in machine learning tasks, where the data used for training a model is different from the data the model is applied to in the real world.
This brief focuses on the definition and detection of distribution shifts in educational settings.
arXiv Detail & Related papers (2024-05-23T05:29:36Z) - Ask Your Distribution Shift if Pre-Training is Right for You [74.18516460467019]
In practice, fine-tuning a pre-trained model improves robustness significantly in some cases but not at all in others.
We focus on two possible failure modes of models under distribution shift: poor extrapolation and biases in the training data.
Our study suggests that, as a rule of thumb, pre-training can help mitigate poor extrapolation but not dataset biases.
arXiv Detail & Related papers (2024-02-29T23:46:28Z) - TEA: Test-time Energy Adaptation [67.4574269851666]
Test-time adaptation (TTA) aims to improve model generalizability when test data diverges from training distribution.
We propose a novel energy-based perspective, enhancing the model's perception of target data distributions.
arXiv Detail & Related papers (2023-11-24T10:49:49Z) - Transductive conformal inference with adaptive scores [3.591224588041813]
We consider the transductive setting, where decisions are made on a test sample of $m$ new points.
We show that their joint distribution follows a P'olya urn model, and establish a concentration inequality for their empirical distribution function.
We demonstrate the usefulness of these theoretical results through uniform, in-probability guarantees for two machine learning tasks.
arXiv Detail & Related papers (2023-10-27T12:48:30Z) - Statistical Foundations of Prior-Data Fitted Networks [0.7614628596146599]
Prior-data fitted networks (PFNs) were recently proposed as a new paradigm for machine learning.
This article establishes a theoretical foundation for PFNs and illuminates the statistical mechanisms governing their behavior.
arXiv Detail & Related papers (2023-05-18T16:34:21Z) - Diagnosing Model Performance Under Distribution Shift [9.143551270841858]
Prediction models can perform poorly when deployed to target distributions different from the training distribution.
Our approach decomposes the performance drop into terms for 1) an increase in harder but frequently seen examples from training, 2) changes in the relationship between features and outcomes, and 3) poor performance on examples infrequent or unseen during training.
arXiv Detail & Related papers (2023-03-03T15:27:16Z) - Partial and Asymmetric Contrastive Learning for Out-of-Distribution
Detection in Long-Tailed Recognition [80.07843757970923]
We show that existing OOD detection methods suffer from significant performance degradation when the training set is long-tail distributed.
We propose Partial and Asymmetric Supervised Contrastive Learning (PASCL), which explicitly encourages the model to distinguish between tail-class in-distribution samples and OOD samples.
Our method outperforms previous state-of-the-art method by $1.29%$, $1.45%$, $0.69%$ anomaly detection false positive rate (FPR) and $3.24%$, $4.06%$, $7.89%$ in-distribution
arXiv Detail & Related papers (2022-07-04T01:53:07Z) - CARD: Classification and Regression Diffusion Models [51.0421331214229]
We introduce classification and regression diffusion (CARD) models, which combine a conditional generative model and a pre-trained conditional mean estimator.
We demonstrate the outstanding ability of CARD in conditional distribution prediction with both toy examples and real-world datasets.
arXiv Detail & Related papers (2022-06-15T03:30:38Z) - Test-time Batch Normalization [61.292862024903584]
Deep neural networks often suffer the data distribution shift between training and testing.
We revisit the batch normalization (BN) in the training process and reveal two key insights benefiting test-time optimization.
We propose a novel test-time BN layer design, GpreBN, which is optimized during testing by minimizing Entropy loss.
arXiv Detail & Related papers (2022-05-20T14:33:39Z) - Agree to Disagree: Diversity through Disagreement for Better
Transferability [54.308327969778155]
We propose D-BAT (Diversity-By-disAgreement Training), which enforces agreement among the models on the training data.
We show how D-BAT naturally emerges from the notion of generalized discrepancy.
arXiv Detail & Related papers (2022-02-09T12:03:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.