Asymptotic Study of In-context Learning with Random Transformers through Equivalent Models
- URL: http://arxiv.org/abs/2509.15152v1
- Date: Thu, 18 Sep 2025 16:57:27 GMT
- Title: Asymptotic Study of In-context Learning with Random Transformers through Equivalent Models
- Authors: Samet Demir, Zafer Dogan,
- Abstract summary: We study the in-context learning capabilities of pretrained Transformers in the setting of nonlinear regression.<n>We show that the random Transformer behaves equivalent to a finite-degree Hermite model in terms of ICL error.
- Score: 8.135786025034626
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the in-context learning (ICL) capabilities of pretrained Transformers in the setting of nonlinear regression. Specifically, we focus on a random Transformer with a nonlinear MLP head where the first layer is randomly initialized and fixed while the second layer is trained. Furthermore, we consider an asymptotic regime where the context length, input dimension, hidden dimension, number of training tasks, and number of training samples jointly grow. In this setting, we show that the random Transformer behaves equivalent to a finite-degree Hermite polynomial model in terms of ICL error. This equivalence is validated through simulations across varying activation functions, context lengths, hidden layer widths (revealing a double-descent phenomenon), and regularization settings. Our results offer theoretical and empirical insights into when and how MLP layers enhance ICL, and how nonlinearity and over-parameterization influence model performance.
Related papers
- How Data Mixing Shapes In-Context Learning: Asymptotic Equivalence for Transformers with MLPs [8.135786025034626]
Pretrained Transformers demonstrate remarkable in-context learning capabilities, enabling them to adapt to new tasks.<n>We study ICL in pretrained Transformers with nonlinear equivalence heads on nonlinear tasks drawn from multiple data sources.<n>Our work advances the theoretical foundations of ICL in Transformers and provides actionable insight into universality of architecture and data in ICL.
arXiv Detail & Related papers (2025-10-29T17:51:57Z) - Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functions [32.71332125930795]
Transformer architectures can solve unseen tasks based on input-output pairs in a given prompt due to in-context learning (ICL)<n>We investigate Markovian function learning through a structured ICL setup to reveal underlying optimization behaviors.
arXiv Detail & Related papers (2025-10-21T13:42:48Z) - Provable In-Context Learning of Nonlinear Regression with Transformers [58.018629320233174]
In-context learning (ICL) is the ability to perform unseen tasks using task-specific prompts without updating parameters.<n>Recent research has actively explored the training dynamics behind ICL.<n>This paper investigates more complex nonlinear regression tasks, aiming to uncover how transformers acquire in-context learning capabilities.
arXiv Detail & Related papers (2025-07-28T00:09:28Z) - Exact Learning Dynamics of In-Context Learning in Linear Transformers and Its Application to Non-Linear Transformers [1.7034813545878589]
Transformer models exhibit remarkable in-context learning (ICL)<n>Our work offers an exact dynamical model for ICL and theoretically grounded tools for analyzing complex transformer training.
arXiv Detail & Related papers (2025-04-17T13:05:33Z) - Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis [97.54180451650122]
We study the dynamics of training a shallow transformer on a task of recognizing co-occurrence of two designated words.
We analyze the gradient flow dynamics of simultaneously training three attention matrices and a linear layer.
We prove a novel property of the gradient flow, termed textitautomatic balancing of gradients, which enables the loss values of different samples to decrease almost at the same rate and further facilitates the proof of near minimum training loss.
arXiv Detail & Related papers (2024-10-12T17:50:58Z) - Asymptotic theory of in-context learning by linear attention [33.53106537972063]
In-context learning is a cornerstone of Transformers' success.<n>Questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unresolved.
arXiv Detail & Related papers (2024-05-20T03:24:24Z) - How Do Nonlinear Transformers Learn and Generalize in In-Context Learning? [82.51626700527837]
Transformer-based large language models displayed impressive in-context learning capabilities, where a pre-trained model can handle new tasks without fine-tuning.
We analyze how the mechanics of how Transformer to achieve ICL contribute to the technical challenges of the training problems in Transformers.
arXiv Detail & Related papers (2024-02-23T21:07:20Z) - How Do Transformers Learn In-Context Beyond Simple Functions? A Case
Study on Learning with Representations [98.7450564309923]
This paper takes initial steps on understanding in-context learning (ICL) in more complex scenarios, by studying learning with representations.
We construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function.
We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size.
arXiv Detail & Related papers (2023-10-16T17:40:49Z) - In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - Transformers as Statisticians: Provable In-Context Learning with
In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL.
We show that transformers can implement a broad class of standard machine learning algorithms in context.
A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.