How Does the Pretraining Distribution Shape In-Context Learning? Task Selection, Generalization, and Robustness
- URL: http://arxiv.org/abs/2510.01163v1
- Date: Wed, 01 Oct 2025 17:52:29 GMT
- Title: How Does the Pretraining Distribution Shape In-Context Learning? Task Selection, Generalization, and Robustness
- Authors: Waïss Azizian, Ali Hasan,
- Abstract summary: We show how statistical properties of the pretraining distribution shape ICL on numerical tasks.<n>We develop a theoretical framework that unifies task selection and generalization, extending and sharpening earlier results.<n>We then empirically study how ICL performance varies with the pretraining distribution on challenging tasks.
- Score: 6.723482324209954
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The emergence of in-context learning (ICL) in large language models (LLMs) remains poorly understood despite its consistent effectiveness, enabling models to adapt to new tasks from only a handful of examples. To clarify and improve these capabilities, we characterize how the statistical properties of the pretraining distribution (e.g., tail behavior, coverage) shape ICL on numerical tasks. We develop a theoretical framework that unifies task selection and generalization, extending and sharpening earlier results, and show how distributional properties govern sample efficiency, task retrieval, and robustness. To this end, we generalize Bayesian posterior consistency and concentration results to heavy-tailed priors and dependent sequences, better reflecting the structure of LLM pretraining data. We then empirically study how ICL performance varies with the pretraining distribution on challenging tasks such as stochastic differential equations and stochastic processes with memory. Together, these findings suggest that controlling key statistical properties of the pretraining distribution is essential for building ICL-capable and reliable LLMs.
Related papers
- A Framework for Quantifying How Pre-Training and Context Benefit In-Context Learning [52.07397258423034]
We propose a new framework to analyze the ICL performance in a class of realistic settings.<n>We derive the precise relationship between ICL performance, context length and the KL divergence between pre-train and query task distribution.
arXiv Detail & Related papers (2025-10-26T09:21:29Z) - Pretrain-Test Task Alignment Governs Generalization in In-Context Learning [39.98824138502169]
In this work, we study how the structure of pretraining tasks governs generalization in ICL.<n>Using a solvable model for ICL of linear regression by linear attention, we derive an exact expression for ICL generalization error in high dimensions.<n>We show that this measure directly predicts ICL performance not only in the solvable model but also in nonlinear Transformers.
arXiv Detail & Related papers (2025-09-30T17:19:58Z) - A Theory of Inference Compute Scaling: Reasoning through Directed Stochastic Skill Search [15.387256204743407]
Large language models (LLMs) demand considerable computational, energy, and financial resources during both training and deployment.<n>Inference costs now represent a significant and growing component of the overall resource burden.<n>We introduce directed skill search (DS3), a general framework that represents inference as expressive over a learned skill graph.
arXiv Detail & Related papers (2025-06-10T14:47:48Z) - Can Pre-training Indicators Reliably Predict Fine-tuning Outcomes of LLMs? [32.04523360747506]
We construct a dataset using 50 1B parameter LLM variants with systematically varied pre-training configurations.<n>We introduce novel unsupervised and supervised proxy metrics derived from pre-training that successfully reduce the relative performance prediction error rate by over 50%.
arXiv Detail & Related papers (2025-04-16T21:19:09Z) - Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance.<n>We introduce novel algorithms for dynamic, instance-level data reweighting.<n>Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z) - Aligning Instruction Tuning with Pre-training [61.50161961371844]
We propose Aligning Instruction Tuning with Pre-training (AITP) to align instruction tuning with pre-training distributions.<n>We show consistent performance improvements with AITP on three fully open large language models (LLMs) across eight benchmarks.
arXiv Detail & Related papers (2025-01-16T08:27:40Z) - Unlocking In-Context Learning for Natural Datasets Beyond Language Modelling [37.36879079951306]
Large Language Models (LLMs) exhibit In-Context Learning (ICL)<n>ICL offers fast adaptation across natural language tasks and domains, but its emergence is less straightforward for modalities beyond text.<n>We identify exact token repetitions in the training data sequences as an important factor for ICL.<n>We unlock ICL capabilities for various visual datasets and a more challenging EEG classification task in a few-shot learning regime.
arXiv Detail & Related papers (2025-01-09T09:45:05Z) - Investigating the Pre-Training Dynamics of In-Context Learning: Task Recognition vs. Task Learning [99.05401042153214]
In-context learning (ICL) is potentially attributed to two major abilities: task recognition (TR) and task learning (TL)
We take the first step by examining the pre-training dynamics of the emergence of ICL.
We propose a simple yet effective method to better integrate these two abilities for ICL at inference time.
arXiv Detail & Related papers (2024-06-20T06:37:47Z) - Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z) - In-context Learning Generalizes, But Not Always Robustly: The Case of Syntax [36.98247762224868]
In-context learning (ICL) is now a common method for teaching large language models (LLMs) new tasks.
Do models infer the underlying structure of the task defined by the context, or do they rely on superficial generalizations that only generalize to identically distributed examples?
In experiments with models from the GPT, PaLM, and Llama 2 families, we find large variance across LMs.
The variance is explained more by the composition of the pre-training corpus and supervision methods than by model size.
arXiv Detail & Related papers (2023-11-13T23:52:43Z) - How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? [92.90857135952231]
Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities.
We study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression.
arXiv Detail & Related papers (2023-10-12T15:01:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.