Differential learning kinetics govern the transition from memorization to generalization during in-context learning
- URL: http://arxiv.org/abs/2412.00104v2
- Date: Thu, 12 Dec 2024 16:10:51 GMT
- Title: Differential learning kinetics govern the transition from memorization to generalization during in-context learning
- Authors: Alex Nguyen, Gautam Reddy,
- Abstract summary: Transformers exhibit in-context learning (ICL): the ability to use novel information presented in the context without additional weight updates.<n>Recent work shows that ICL emerges when models are trained on a sufficiently diverse set of tasks.<n>We show that the sub-circuits that memorize and generalize can be viewed as largely independent.
- Score: 0.5555497750998242
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers exhibit in-context learning (ICL): the ability to use novel information presented in the context without additional weight updates. Recent work shows that ICL emerges when models are trained on a sufficiently diverse set of tasks and the transition from memorization to generalization is sharp with increasing task diversity. One interpretation is that a network's limited capacity to memorize favors generalization. Here, we examine the mechanistic underpinnings of this transition using a small transformer applied to a synthetic ICL task. Using theory and experiment, we show that the sub-circuits that memorize and generalize can be viewed as largely independent. The relative rates at which these sub-circuits learn explains the transition from memorization to generalization, rather than capacity constraints. We uncover a memorization scaling law, which determines the task diversity threshold at which the network generalizes. The theory quantitatively explains a variety of other ICL-related phenomena, including the long-tailed distribution of when ICL is acquired, the bimodal behavior of solutions close to the task diversity threshold, the influence of contextual and data distributional statistics on ICL, and the transient nature of ICL.
Related papers
- Provable In-Context Learning of Nonlinear Regression with Transformers [58.018629320233174]
In-context learning (ICL) is the ability to perform unseen tasks using task-specific prompts without updating parameters.<n>Recent research has actively explored the training dynamics behind ICL.<n>This paper investigates more complex nonlinear regression tasks, aiming to uncover how transformers acquire in-context learning capabilities.
arXiv Detail & Related papers (2025-07-28T00:09:28Z) - When can in-context learning generalize out of task distribution? [10.962094053749095]
In-context learning (ICL) is a capability of pretrained transformers that allows models to generalize to unseen tasks after seeing only a few examples.<n>We investigate empirically the conditions necessary on the pretraining distribution for ICL to emerge and generalize emphout-of-distribution<n>We find that as task diversity increases, transformers undergo a transition from a specialized solution, which exhibits ICL only within the pretraining task distribution, to a solution which generalizes out of distribution to the entire task space.
arXiv Detail & Related papers (2025-06-05T20:30:50Z) - Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence [28.260455480198047]
Transformer-based language models exhibit In-Context Learning (ICL), where predictions are made adaptively based on context.<n>We experimentally clarify how such meta-learning ability is acquired by analyzing the dynamics of the model's circuit during training.
arXiv Detail & Related papers (2025-05-22T13:59:30Z) - Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning [48.67380502157004]
Large-scale Transformer language models (LMs) trained solely on next-token prediction with web-scale data can solve a wide range of tasks.<n>The mechanism behind this capability, known as in-context learning (ICL), remains both controversial and poorly understood.
arXiv Detail & Related papers (2025-05-16T08:50:42Z) - Understanding the Generalization of In-Context Learning in Transformers: An Empirical Study [45.08382242972142]
Large language models (LLMs) like GPT-4 and LLaMA-3 utilize the powerful in-context learning (ICL) capability of Transformer architecture to learn on the fly from limited examples.
We present a systematic investigation of transformers' generalization capability with ICL relative to training data coverage.
We find that transformers lack inter-problem generalization with ICL, but excel in intra-task and intra-problem generalization.
arXiv Detail & Related papers (2025-03-19T13:40:45Z) - Can Transformers Learn Full Bayesian Inference in Context? [13.479322264788367]
We show that transformers can perform full Bayesian inference for commonly used statistical models in context.
We introduce a general framework that builds on ideas from prior fitted networks and continuous normalizing flows.
Experiments on real-world datasets demonstrate that our ICL approach yields posterior samples that are similar in quality to state-of-the-art MCMC or variational inference methods.
arXiv Detail & Related papers (2025-01-28T10:04:53Z) - Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning [53.685764040547625]
Transformer-based large language models (LLMs) have displayed remarkable creative prowess and emergence capabilities.
This work provides a fine mathematical analysis to show how transformers leverage the multi-concept semantics of words to enable powerful ICL and excellent out-of-distribution ICL abilities.
arXiv Detail & Related papers (2024-11-04T15:54:32Z) - Transformers are Minimax Optimal Nonparametric In-Context Learners [36.291980654891496]
In-context learning of large language models has proven to be a surprisingly effective method of learning a new task from only a few demonstrative examples.
We develop approximation and generalization error bounds for a transformer composed of a deep neural network and one linear attention layer.
We show that sufficiently trained transformers can achieve -- and even improve upon -- the minimax optimal estimation risk in context.
arXiv Detail & Related papers (2024-08-22T08:02:10Z) - In-Context Learning with Representations: Contextual Generalization of Trained Transformers [66.78052387054593]
In-context learning (ICL) refers to a capability of pretrained large language models, which can learn a new task given a few examples during inference.
This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks.
arXiv Detail & Related papers (2024-08-19T16:47:46Z) - Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the output probabilities and the pretraining data frequency.<n>We show that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z) - Asymptotic theory of in-context learning by linear attention [33.53106537972063]
In-context learning is a cornerstone of Transformers' success.
Questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unresolved.
arXiv Detail & Related papers (2024-05-20T03:24:24Z) - The Transient Nature of Emergent In-Context Learning in Transformers [28.256651019346023]
Transformer networks can exhibit a surprising capacity for in-context learning (ICL) despite not being explicitly trained for it.
We show that the emergence of ICL during transformer training is, in fact, often transient.
We find that ICL first emerges, then disappears and gives way to IWL, all while the training loss decreases.
arXiv Detail & Related papers (2023-11-14T18:03:20Z) - How Do Transformers Learn In-Context Beyond Simple Functions? A Case
Study on Learning with Representations [98.7450564309923]
This paper takes initial steps on understanding in-context learning (ICL) in more complex scenarios, by studying learning with representations.
We construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function.
We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size.
arXiv Detail & Related papers (2023-10-16T17:40:49Z) - What and How does In-Context Learning Learn? Bayesian Model Averaging,
Parameterization, and Generalization [111.55277952086155]
We study In-Context Learning (ICL) by addressing several open questions.
We show that, without updating the neural network parameters, ICL implicitly implements the Bayesian model averaging algorithm.
We prove that the error of pretrained model is bounded by a sum of an approximation error and a generalization error.
arXiv Detail & Related papers (2023-05-30T21:23:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.