Differential learning kinetics govern the transition from memorization to generalization during in-context learning
- URL: http://arxiv.org/abs/2412.00104v2
- Date: Thu, 12 Dec 2024 16:10:51 GMT
- Title: Differential learning kinetics govern the transition from memorization to generalization during in-context learning
- Authors: Alex Nguyen, Gautam Reddy,
- Abstract summary: Transformers exhibit in-context learning (ICL): the ability to use novel information presented in the context without additional weight updates.
Recent work shows that ICL emerges when models are trained on a sufficiently diverse set of tasks.
We show that the sub-circuits that memorize and generalize can be viewed as largely independent.
- Score: 0.5555497750998242
- License:
- Abstract: Transformers exhibit in-context learning (ICL): the ability to use novel information presented in the context without additional weight updates. Recent work shows that ICL emerges when models are trained on a sufficiently diverse set of tasks and the transition from memorization to generalization is sharp with increasing task diversity. One interpretation is that a network's limited capacity to memorize favors generalization. Here, we examine the mechanistic underpinnings of this transition using a small transformer applied to a synthetic ICL task. Using theory and experiment, we show that the sub-circuits that memorize and generalize can be viewed as largely independent. The relative rates at which these sub-circuits learn explains the transition from memorization to generalization, rather than capacity constraints. We uncover a memorization scaling law, which determines the task diversity threshold at which the network generalizes. The theory quantitatively explains a variety of other ICL-related phenomena, including the long-tailed distribution of when ICL is acquired, the bimodal behavior of solutions close to the task diversity threshold, the influence of contextual and data distributional statistics on ICL, and the transient nature of ICL.
Related papers
- Can Transformers Learn Full Bayesian Inference in Context? [13.479322264788367]
We show that transformers can perform full Bayesian inference for commonly used statistical models in context.
We introduce a general framework that builds on ideas from prior fitted networks and continuous normalizing flows.
Experiments on real-world datasets demonstrate that our ICL approach yields posterior samples that are similar in quality to state-of-the-art MCMC or variational inference methods.
arXiv Detail & Related papers (2025-01-28T10:04:53Z) - Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning [53.685764040547625]
Transformer-based large language models (LLMs) have displayed remarkable creative prowess and emergence capabilities.
This work provides a fine mathematical analysis to show how transformers leverage the multi-concept semantics of words to enable powerful ICL and excellent out-of-distribution ICL abilities.
arXiv Detail & Related papers (2024-11-04T15:54:32Z) - Transformers are Minimax Optimal Nonparametric In-Context Learners [36.291980654891496]
In-context learning of large language models has proven to be a surprisingly effective method of learning a new task from only a few demonstrative examples.
We develop approximation and generalization error bounds for a transformer composed of a deep neural network and one linear attention layer.
We show that sufficiently trained transformers can achieve -- and even improve upon -- the minimax optimal estimation risk in context.
arXiv Detail & Related papers (2024-08-22T08:02:10Z) - In-Context Learning with Representations: Contextual Generalization of Trained Transformers [66.78052387054593]
In-context learning (ICL) refers to a capability of pretrained large language models, which can learn a new task given a few examples during inference.
This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks.
arXiv Detail & Related papers (2024-08-19T16:47:46Z) - Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the output probabilities and the pretraining data frequency.
We show that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z) - Asymptotic theory of in-context learning by linear attention [33.53106537972063]
In-context learning is a cornerstone of Transformers' success.
Questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unresolved.
arXiv Detail & Related papers (2024-05-20T03:24:24Z) - How Do Nonlinear Transformers Learn and Generalize in In-Context Learning? [82.51626700527837]
Transformer-based large language models displayed impressive in-context learning capabilities, where a pre-trained model can handle new tasks without fine-tuning.
We analyze how the mechanics of how Transformer to achieve ICL contribute to the technical challenges of the training problems in Transformers.
arXiv Detail & Related papers (2024-02-23T21:07:20Z) - The Transient Nature of Emergent In-Context Learning in Transformers [28.256651019346023]
Transformer networks can exhibit a surprising capacity for in-context learning (ICL) despite not being explicitly trained for it.
We show that the emergence of ICL during transformer training is, in fact, often transient.
We find that ICL first emerges, then disappears and gives way to IWL, all while the training loss decreases.
arXiv Detail & Related papers (2023-11-14T18:03:20Z) - How Do Transformers Learn In-Context Beyond Simple Functions? A Case
Study on Learning with Representations [98.7450564309923]
This paper takes initial steps on understanding in-context learning (ICL) in more complex scenarios, by studying learning with representations.
We construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function.
We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size.
arXiv Detail & Related papers (2023-10-16T17:40:49Z) - What and How does In-Context Learning Learn? Bayesian Model Averaging,
Parameterization, and Generalization [111.55277952086155]
We study In-Context Learning (ICL) by addressing several open questions.
We show that, without updating the neural network parameters, ICL implicitly implements the Bayesian model averaging algorithm.
We prove that the error of pretrained model is bounded by a sum of an approximation error and a generalization error.
arXiv Detail & Related papers (2023-05-30T21:23:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.