Related papers: Next-token pretraining implies in-context learning

Next-token pretraining implies in-context learning

URL: http://arxiv.org/abs/2505.18373v2
Date: Sun, 13 Jul 2025 01:17:02 GMT
Title: Next-token pretraining implies in-context learning
Authors: Paul M. Riechers, Henry R. Bigelow, Eric A. Alt, Adam Shai,
Abstract summary: We show how models adapt to context when trained on token sequences, especially from non-ergodic sources.<n>Our information-theoretic framework precisely predicts these in-distribution ICL dynamics.<n>We also show that a model's in-context performance on any task is mathematically coupled to the ensemble of tasks seen in pretraining.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We argue that in-context learning (ICL) predictably arises from standard self-supervised next-token pretraining, rather than being an exotic emergent property. This work establishes the foundational principles of this emergence by focusing on in-distribution ICL, demonstrating how models necessarily adapt to context when trained on token sequences, especially from non-ergodic sources. Our information-theoretic framework precisely predicts these in-distribution ICL dynamics (i.e., context-dependent loss reduction). We verify this with experiments using synthetic datasets of differing types of correlational structure, reproducing characteristic phenomena like phase transitions in training loss for induction head formation and power-law scaling of in-context loss. We further show that a model's in-context performance on any task is mathematically coupled to the ensemble of tasks seen in pretraining, offering a fundamental explanation, grounded in architecture- and modality-independent principles, for such inference-time learning.

Related papers

Emergent Structured Representations Support Flexible In-Context Inference in Large Language Models [77.98801218316505]
Large language models (LLMs) exhibit emergent behaviors suggestive of human-like reasoning.<n>We investigate the internal processing of LLMs during in-context concept inference.
arXiv Detail & Related papers (2026-02-08T03:14:39Z)
A Framework for Quantifying How Pre-Training and Context Benefit In-Context Learning [52.07397258423034]
We propose a new framework to analyze the ICL performance in a class of realistic settings.<n>We derive the precise relationship between ICL performance, context length and the KL divergence between pre-train and query task distribution.
arXiv Detail & Related papers (2025-10-26T09:21:29Z)
Toward Understanding In-context vs. In-weight Learning [50.24035812301655]
We identify simplified distributional properties that give rise to the emergence and disappearance of in-context learning.<n>We then extend the study to a full large language model, showing how fine-tuning on various collections of natural language prompts can elicit similar in-context and in-weight learning behaviour.
arXiv Detail & Related papers (2024-10-30T14:09:00Z)
Sequential Representation Learning via Static-Dynamic Conditional Disentanglement [58.19137637859017]
This paper explores self-supervised disentangled representation learning within sequential data, focusing on separating time-independent and time-varying factors in videos. We propose a new model that breaks the usual independence assumption between those factors by explicitly accounting for the causal relationship between the static/dynamic variables. Experiments show that the proposed approach outperforms previous complex state-of-the-art techniques in scenarios where the dynamics of a scene are influenced by its content.
arXiv Detail & Related papers (2024-08-10T17:04:39Z)
Distributional Associations vs In-Context Reasoning: A Study of Feed-forward and Attention Layers [49.80959223722325]
We study the distinction between feed-forward and attention layers in large language models.<n>We find that feed-forward layers tend to learn simple distributional associations such as bigrams, while attention layers focus on in-context reasoning.
arXiv Detail & Related papers (2024-06-05T08:51:08Z)
On Understanding Attention-Based In-Context Learning for Categorical Data [49.40350941996942]
We develop a network composed of attention blocks, with each block employing a self-attention layer followed by a cross-attention layer, with associated skip connections.<n>This model can exactly perform multi-step functional GD inference for in-context inference with categorical observations.
arXiv Detail & Related papers (2024-05-27T15:03:21Z)
The mechanistic basis of data dependence and abrupt learning in an in-context classification task [0.3626013617212666]
We show that specific distributional properties inherent in language control the trade-off or simultaneous appearance of two forms of learning. In-context learning is driven by the abrupt emergence of an induction head, which subsequently competes with in-weights learning. We propose that the sharp transitions in attention-based networks arise due to a specific chain of multi-layer operations necessary to achieve ICL.
arXiv Detail & Related papers (2023-12-03T20:53:41Z)
A Theory of Emergent In-Context Learning as Implicit Structure Induction [8.17811111226145]
Scaling large language models leads to an emergent capacity to learn in-context from example demonstrations. We argue that in-context learning relies on recombination of compositional operations found in natural language data. We show how in-context learning is supported by a representation of the input's compositional structure.
arXiv Detail & Related papers (2023-03-14T15:24:05Z)
EigenNoise: A Contrastive Prior to Warm-Start Representations [0.0]
We present a naive scheme for word vectors based on a dense, independent co-occurrence model. We show that our model, EigenNoise, can approach the performance of empirically trained GloVe despite the lack of pre-training data.
arXiv Detail & Related papers (2022-05-09T15:30:50Z)
An Explanation of In-context Learning as Implicit Bayesian Inference [117.19809377740188]
We study the role of the pretraining distribution on the emergence of in-context learning. We prove that in-context learning occurs implicitly via Bayesian inference of the latent concept. We empirically find that scaling model size improves in-context accuracy even when the pretraining loss is the same.
arXiv Detail & Related papers (2021-11-03T09:12:33Z)
Developing Constrained Neural Units Over Time [81.19349325749037]
This paper focuses on an alternative way of defining Neural Networks, that is different from the majority of existing approaches. The structure of the neural architecture is defined by means of a special class of constraints that are extended also to the interaction with data. The proposed theory is cast into the time domain, in which data are presented to the network in an ordered manner.
arXiv Detail & Related papers (2020-09-01T09:07:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.