Pretrain-Test Task Alignment Governs Generalization in In-Context Learning
- URL: http://arxiv.org/abs/2509.26551v1
- Date: Tue, 30 Sep 2025 17:19:58 GMT
- Title: Pretrain-Test Task Alignment Governs Generalization in In-Context Learning
- Authors: Mary I. Letey, Jacob A. Zavatone-Veth, Yue M. Lu, Cengiz Pehlevan,
- Abstract summary: In this work, we study how the structure of pretraining tasks governs generalization in ICL.<n>Using a solvable model for ICL of linear regression by linear attention, we derive an exact expression for ICL generalization error in high dimensions.<n>We show that this measure directly predicts ICL performance not only in the solvable model but also in nonlinear Transformers.
- Score: 39.98824138502169
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In-context learning (ICL) is a central capability of Transformer models, but the structures in data that enable its emergence and govern its robustness remain poorly understood. In this work, we study how the structure of pretraining tasks governs generalization in ICL. Using a solvable model for ICL of linear regression by linear attention, we derive an exact expression for ICL generalization error in high dimensions under arbitrary pretraining-testing task covariance mismatch. This leads to a new alignment measure that quantifies how much information about the pretraining task distribution is useful for inference at test time. We show that this measure directly predicts ICL performance not only in the solvable model but also in nonlinear Transformers. Our analysis further reveals a tradeoff between specialization and generalization in ICL: depending on task distribution alignment, increasing pretraining task diversity can either improve or harm test performance. Together, these results identify train-test task alignment as a key determinant of generalization in ICL.
Related papers
- A Framework for Quantifying How Pre-Training and Context Benefit In-Context Learning [52.07397258423034]
We propose a new framework to analyze the ICL performance in a class of realistic settings.<n>We derive the precise relationship between ICL performance, context length and the KL divergence between pre-train and query task distribution.
arXiv Detail & Related papers (2025-10-26T09:21:29Z) - In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning [51.56484100374058]
We introduce a principled risk decomposition that separates the total ICL risk into two components: Bayes Gap and Posterior Variance.<n>For a uniform-attention Transformer, we derive a non-asymptotic upper bound on this gap, which explicitly clarifies the dependence on the number of pretraining prompts.<n>The Posterior Variance is a model-independent risk representing the intrinsic task uncertainty.
arXiv Detail & Related papers (2025-10-13T03:42:31Z) - Learning Linear Regression with Low-Rank Tasks in-Context [8.347662730632047]
In-context learning (ICL) is a key building block of modern large language models.<n>We analyze a linear attention model trained on low-rank regression tasks.<n>We find that statistical fluctuations in finite pre-training data induce an implicit regularization.
arXiv Detail & Related papers (2025-10-06T07:27:49Z) - How Does the Pretraining Distribution Shape In-Context Learning? Task Selection, Generalization, and Robustness [6.723482324209954]
We show how statistical properties of the pretraining distribution shape ICL on numerical tasks.<n>We develop a theoretical framework that unifies task selection and generalization, extending and sharpening earlier results.<n>We then empirically study how ICL performance varies with the pretraining distribution on challenging tasks.
arXiv Detail & Related papers (2025-10-01T17:52:29Z) - Surprise Calibration for Better In-Context Learning [6.566285172635043]
In-context learning (ICL) has emerged as a powerful paradigm for task adaptation in large language models.<n>Existing bias calibration methods apply fixed class priors across all inputs, limiting their efficacy in dynamic ICL settings.<n>We introduce a novel method-Surprise (SC), which captures the temporal dynamics of class priors.
arXiv Detail & Related papers (2025-06-15T10:04:42Z) - Technical Debt in In-Context Learning: Diminishing Efficiency in Long Context [24.905102026459428]
Transformers have demonstrated remarkable in-context learning capabilities, adapting to new tasks by simply conditioning on demonstrations without parameter updates.<n>It remains unclear to what extent the transformers optimally learn in-context compared to principled learning algorithms.
arXiv Detail & Related papers (2025-02-07T00:26:45Z) - Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization.
A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR.
For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z) - How Do Nonlinear Transformers Learn and Generalize in In-Context Learning? [82.51626700527837]
Transformer-based large language models displayed impressive in-context learning capabilities, where a pre-trained model can handle new tasks without fine-tuning.
We analyze how the mechanics of how Transformer to achieve ICL contribute to the technical challenges of the training problems in Transformers.
arXiv Detail & Related papers (2024-02-23T21:07:20Z) - Test-Time Training for Semantic Segmentation with Output Contrastive
Loss [12.535720010867538]
Deep learning-based segmentation models have achieved impressive performance on public benchmarks, but generalizing well to unseen environments remains a major challenge.
This paper introduces Contrastive Loss (OCL), known for its capability to learn robust and generalized representations, to stabilize the adaptation process.
Our method excels even when applied to models initially pre-trained using domain adaptation methods on test domain data, showcasing its resilience and adaptability.
arXiv Detail & Related papers (2023-11-14T03:13:47Z) - In-context Learning Generalizes, But Not Always Robustly: The Case of Syntax [36.98247762224868]
In-context learning (ICL) is now a common method for teaching large language models (LLMs) new tasks.
Do models infer the underlying structure of the task defined by the context, or do they rely on superficial generalizations that only generalize to identically distributed examples?
In experiments with models from the GPT, PaLM, and Llama 2 families, we find large variance across LMs.
The variance is explained more by the composition of the pre-training corpus and supervision methods than by model size.
arXiv Detail & Related papers (2023-11-13T23:52:43Z) - How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? [92.90857135952231]
Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities.
We study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression.
arXiv Detail & Related papers (2023-10-12T15:01:43Z) - What and How does In-Context Learning Learn? Bayesian Model Averaging,
Parameterization, and Generalization [111.55277952086155]
We study In-Context Learning (ICL) by addressing several open questions.
We show that, without updating the neural network parameters, ICL implicitly implements the Bayesian model averaging algorithm.
We prove that the error of pretrained model is bounded by a sum of an approximation error and a generalization error.
arXiv Detail & Related papers (2023-05-30T21:23:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.