Related papers: The Transient Nature of Emergent In-Context Learning in Transformers

The Transient Nature of Emergent In-Context Learning in Transformers

URL: http://arxiv.org/abs/2311.08360v3
Date: Mon, 11 Dec 2023 21:42:31 GMT
Title: The Transient Nature of Emergent In-Context Learning in Transformers
Authors: Aaditya K. Singh, Stephanie C.Y. Chan, Ted Moskovitz, Erin Grant, Andrew M. Saxe, Felix Hill
Abstract summary: Transformer networks can exhibit a surprising capacity for in-context learning (ICL) despite not being explicitly trained for it. We show that the emergence of ICL during transformer training is, in fact, often transient. We find that ICL first emerges, then disappears and gives way to IWL, all while the training loss decreases.
Score: 28.256651019346023
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer neural networks can exhibit a surprising capacity for in-context learning (ICL) despite not being explicitly trained for it. Prior work has provided a deeper understanding of how ICL emerges in transformers, e.g. through the lens of mechanistic interpretability, Bayesian inference, or by examining the distributional properties of training data. However, in each of these cases, ICL is treated largely as a persistent phenomenon; namely, once ICL emerges, it is assumed to persist asymptotically. Here, we show that the emergence of ICL during transformer training is, in fact, often transient. We train transformers on synthetic data designed so that both ICL and in-weights learning (IWL) strategies can lead to correct predictions. We find that ICL first emerges, then disappears and gives way to IWL, all while the training loss decreases, indicating an asymptotic preference for IWL. The transient nature of ICL is observed in transformers across a range of model sizes and datasets, raising the question of how much to "overtrain" transformers when seeking compact, cheaper-to-run models. We find that L2 regularization may offer a path to more persistent ICL that removes the need for early stopping based on ICL-style validation tasks. Finally, we present initial evidence that ICL transience may be caused by competition between ICL and IWL circuits.

Related papers

Strategy Coopetition Explains the Emergence and Transience of In-Context Learning [24.63934469340368]
In-context learning (ICL) is a powerful ability that emerges in transformer models, enabling them to learn from context without weight updates. Recent work has established emergent ICL as a transient phenomenon that can sometimes disappear after long training times. We propose a minimal mathematical model that reproduces these key dynamics and interactions.
arXiv Detail & Related papers (2025-03-07T17:54:05Z)
Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights. This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task. We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z)
On the Training Convergence of Transformers for In-Context Classification [20.980349268151546]
This work aims to theoretically study the training dynamics of transformers for in-context classification tasks. We demonstrate that, for in-context classification of Gaussian mixtures under certain assumptions, a single-layer transformer trained via gradient descent converges to a globally optimal model at a linear rate.
arXiv Detail & Related papers (2024-10-15T16:57:14Z)
Exact Conversion of In-Context Learning to Model Weights in Linearized-Attention Transformers [30.145669421100965]
In-Context Learning is a powerful emergent property of large language models. We show that, for linearized transformer networks, ICL can be made explicit and permanent through the inclusion of bias terms. Our algorithm (ICLCA) allows for exact conversion in an inexpensive manner.
arXiv Detail & Related papers (2024-06-05T01:47:40Z)
How Do Nonlinear Transformers Learn and Generalize in In-Context Learning? [82.51626700527837]
Transformer-based large language models displayed impressive in-context learning capabilities, where a pre-trained model can handle new tasks without fine-tuning. We analyze how the mechanics of how Transformer to achieve ICL contribute to the technical challenges of the training problems in Transformers.
arXiv Detail & Related papers (2024-02-23T21:07:20Z)
Positional Information Matters for Invariant In-Context Learning: A Case Study of Simple Function Classes [39.08988313527199]
In-context learning (ICL) refers to the ability of a model to condition on a few in-context demonstrations to generate the answer for a new query input. Despite the impressive ICL ability of LLMs, ICL in LLMs is sensitive to input demonstrations and limited to short context lengths.
arXiv Detail & Related papers (2023-11-30T02:26:55Z)
How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations [98.7450564309923]
This paper takes initial steps on understanding in-context learning (ICL) in more complex scenarios, by studying learning with representations. We construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function. We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size.
arXiv Detail & Related papers (2023-10-16T17:40:49Z)
Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL. We show that transformers can implement a broad class of standard machine learning algorithms in context. A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z)
What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization [111.55277952086155]
We study In-Context Learning (ICL) by addressing several open questions. We show that, without updating the neural network parameters, ICL implicitly implements the Bayesian model averaging algorithm. We prove that the error of pretrained model is bounded by a sum of an approximation error and a generalization error.
arXiv Detail & Related papers (2023-05-30T21:23:47Z)
Learning Bounded Context-Free-Grammar via LSTM and the Transformer:Difference and Explanations [51.77000472945441]
Long Short-Term Memory (LSTM) and Transformers are two popular neural architectures used for natural language processing tasks. In practice, it is often observed that Transformer models have better representation power than LSTM. We study such practical differences between LSTM and Transformer and propose an explanation based on their latent space decomposition patterns.
arXiv Detail & Related papers (2021-12-16T19:56:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.