How Transformers Learn In-Context Recall Tasks? Optimality, Training Dynamics and Generalization
- URL: http://arxiv.org/abs/2505.15009v3
- Date: Tue, 21 Oct 2025 17:34:30 GMT
- Title: How Transformers Learn In-Context Recall Tasks? Optimality, Training Dynamics and Generalization
- Authors: Quan Nguyen, Thanh Nguyen-Tang,
- Abstract summary: We study the approximation capabilities, convergence speeds and on-convergence behaviors of transformers trained on in-context recall tasks.<n>We show that the trained transformers exhibit out-of-distribution generalization, i.e., generalizing to samples outside of the population distribution.
- Score: 23.759737527800585
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the approximation capabilities, convergence speeds and on-convergence behaviors of transformers trained on in-context recall tasks -- which requires to recognize the \emph{positional} association between a pair of tokens from in-context examples. Existing theoretical results only focus on the in-context reasoning behavior of transformers after being trained for the \emph{one} gradient descent step. It remains unclear what is the on-convergence behavior of transformers being trained by gradient descent and how fast the convergence rate is. In addition, the generalization of transformers in one-step in-context reasoning has not been formally investigated. This work addresses these gaps. We first show that a class of transformers with either linear, ReLU or softmax attentions, is provably Bayes-optimal for an in-context recall task. When being trained with gradient descent, we show via a finite-sample analysis that the expected loss converges at linear rate to the Bayes risks. Moreover, we show that the trained transformers exhibit out-of-distribution (OOD) generalization, i.e., generalizing to samples outside of the population distribution. Our theoretical findings are further supported by extensive empirical validations, showing that \emph{without} proper parameterization, models with larger expressive power surprisingly \emph{fail} to generalize OOD after being trained by gradient descent.
Related papers
- Quantitative Bounds for Length Generalization in Transformers [58.175107357008876]
We study the problem of length generalization (LG) in transformers.<n>LG occurs when the internal behavior of the transformer on longer sequences can be "simulated" by its behavior on shorter sequences.
arXiv Detail & Related papers (2025-10-30T21:31:36Z) - Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities [58.742178800799614]
We study a family of $textitretrieval$ and $textitcopying$ tasks inspired by Liu et al.<n>We observe an $textitinduction-versus-anti-induction$ asymmetry, where pretrained models are better at retrieving tokens to the right (induction) than the left (anti-induction) of a query token.<n>Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained transformers.
arXiv Detail & Related papers (2025-05-27T21:36:50Z) - Continuum Transformers Perform In-Context Learning by Operator Gradient Descent [24.369713234908627]
We show that continuum transformers can perform in-context operator learning by performing gradient descent in an operator RKHS.<n>We provide empirical validations of this optimality result and demonstrate that the parameters under which such gradient descent is performed are recovered through the continuum transformer training.
arXiv Detail & Related papers (2025-05-23T12:52:54Z) - Learning and Transferring Sparse Contextual Bigrams with Linear Transformers [47.37256334633102]
We introduce the Sparse Con Bigram model, where the next token's generation depends on a sparse set of earlier positions determined by the last token.
We analyze the training dynamics and sample complexity of learning SCB using a one-layer linear transformer with a gradient-based algorithm.
We prove that, provided a nontrivial correlation between the downstream and pretraining tasks, finetuning from a pretrained model allows us to bypass the initial sample-intensive stage.
arXiv Detail & Related papers (2024-10-30T20:29:10Z) - Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape.
This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z) - Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis [82.51626700527837]
Chain-of-shift (CoT) is an efficient method that enables the reasoning ability of large language models by augmenting the query using examples with multiple intermediate steps.
We show that despite the theoretical success of CoT, it fails to provide an accurate generalization when CoT does.
arXiv Detail & Related papers (2024-10-03T03:12:51Z) - Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model.
Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z) - Non-asymptotic Convergence of Training Transformers for Next-token Prediction [48.9399496805422]
Transformers have achieved extraordinary success in modern machine learning due to their excellent ability to handle sequential data.
This paper provides a fine-grained non-asymptotic analysis of the training dynamics of a one-layer transformer.
We show that the trained transformer presents non-token prediction ability with dataset shift.
arXiv Detail & Related papers (2024-09-25T20:22:06Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability [34.43255978863601]
Several suggest that transformers learn a mesa-optimizer during autorere training.
We show that a stronger assumption related to the moments of data is the sufficient necessary condition that the learned mesa-optimizer can perform.
arXiv Detail & Related papers (2024-05-27T05:41:06Z) - On the Generalization Ability of Unsupervised Pretraining [53.06175754026037]
Recent advances in unsupervised learning have shown that unsupervised pre-training, followed by fine-tuning, can improve model generalization.
This paper introduces a novel theoretical framework that illuminates the critical factor influencing the transferability of knowledge acquired during unsupervised pre-training to the subsequent fine-tuning phase.
Our results contribute to a better understanding of unsupervised pre-training and fine-tuning paradigm, and can shed light on the design of more effective pre-training algorithms.
arXiv Detail & Related papers (2024-03-11T16:23:42Z) - How do Transformers perform In-Context Autoregressive Learning? [76.18489638049545]
We train a Transformer model on a simple next token prediction task.
We show how a trained Transformer predicts the next token by first learning $W$ in-context, then applying a prediction mapping.
arXiv Detail & Related papers (2024-02-08T16:24:44Z) - In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - What Happens During Finetuning of Vision Transformers: An Invariance
Based Investigation [7.432224771219168]
The pretrain-finetune paradigm usually improves downstream performance over training a model from scratch on the same task.
In this work, we examine the relationship between pretrained vision transformers and the corresponding finetuned versions on several benchmark datasets and tasks.
arXiv Detail & Related papers (2023-07-12T08:35:24Z) - Trained Transformers Learn Linear Models In-Context [39.56636898650966]
Attention-based neural networks as transformers have demonstrated a remarkable ability to exhibit inattention learning (ICL)
We show that when transformer training over random instances of linear regression problems, these models' predictions mimic nonlinear of ordinary squares.
arXiv Detail & Related papers (2023-06-16T15:50:03Z) - Transformers learn to implement preconditioned gradient descent for
in-context learning [41.74394657009037]
Several recent works demonstrate that transformers can implement algorithms like gradient descent.
We ask: Can transformers learn to implement such algorithms by training over random problem instances?
For a transformer with $L$ attention layers, we prove certain critical points of the training objective implement $L$ iterations of preconditioned gradient descent.
arXiv Detail & Related papers (2023-06-01T02:35:57Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.