Related papers: How Data Mixing Shapes In-Context Learning: Asymptotic Equivalence for Transformers with MLPs

How Data Mixing Shapes In-Context Learning: Asymptotic Equivalence for Transformers with MLPs

URL: http://arxiv.org/abs/2510.25753v1
Date: Wed, 29 Oct 2025 17:51:57 GMT
Title: How Data Mixing Shapes In-Context Learning: Asymptotic Equivalence for Transformers with MLPs
Authors: Samet Demir, Zafer Dogan,
Abstract summary: Pretrained Transformers demonstrate remarkable in-context learning capabilities, enabling them to adapt to new tasks.<n>We study ICL in pretrained Transformers with nonlinear equivalence heads on nonlinear tasks drawn from multiple data sources.<n>Our work advances the theoretical foundations of ICL in Transformers and provides actionable insight into universality of architecture and data in ICL.
Score: 8.135786025034626
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pretrained Transformers demonstrate remarkable in-context learning (ICL) capabilities, enabling them to adapt to new tasks from demonstrations without parameter updates. However, theoretical studies often rely on simplified architectures (e.g., omitting MLPs), data models (e.g., linear regression with isotropic inputs), and single-source training, limiting their relevance to realistic settings. In this work, we study ICL in pretrained Transformers with nonlinear MLP heads on nonlinear tasks drawn from multiple data sources with heterogeneous input, task, and noise distributions. We analyze a model where the MLP comprises two layers, with the first layer trained via a single gradient step and the second layer fully optimized. Under high-dimensional asymptotics, we prove that such models are equivalent in ICL error to structured polynomial predictors, leveraging results from the theory of Gaussian universality and orthogonal polynomials. This equivalence reveals that nonlinear MLPs meaningfully enhance ICL performance, particularly on nonlinear tasks, compared to linear baselines. It also enables a precise analysis of data mixing effects: we identify key properties of high-quality data sources (low noise, structured covariances) and show that feature learning emerges only when the task covariance exhibits sufficient structure. These results are validated empirically across various activation functions, model sizes, and data distributions. Finally, we experiment with a real-world scenario involving multilingual sentiment analysis where each language is treated as a different source. Our experimental results for this case exemplify how our findings extend to real-world cases. Overall, our work advances the theoretical foundations of ICL in Transformers and provides actionable insight into the role of architecture and data in ICL.

Related papers

Asymptotic Study of In-context Learning with Random Transformers through Equivalent Models [8.135786025034626]
We study the in-context learning capabilities of pretrained Transformers in the setting of nonlinear regression.<n>We show that the random Transformer behaves equivalent to a finite-degree Hermite model in terms of ICL error.
arXiv Detail & Related papers (2025-09-18T16:57:27Z)
Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning [77.120955854093]
We show that data diversity can be a strong predictor of generalization in language models.<n>We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients.<n>We present Prismatic Synthesis, a framework for generating diverse synthetic data.
arXiv Detail & Related papers (2025-05-26T16:05:10Z)
Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks.<n>We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z)
Re-examining learning linear functions in context [4.126494564662494]
In-context learning (ICL) has emerged as a powerful paradigm for easily adapting Large Language Models (LLMs) to various tasks.<n>We explore a simple model of ICL in a controlled setup with synthetic training data.<n>Our findings challenge the prevailing narrative that transformers adopt algorithmic approaches to learn a linear function in-context.
arXiv Detail & Related papers (2024-11-18T10:58:46Z)
Transformers are Minimax Optimal Nonparametric In-Context Learners [36.291980654891496]
In-context learning of large language models has proven to be a surprisingly effective method of learning a new task from only a few demonstrative examples. We develop approximation and generalization error bounds for a transformer composed of a deep neural network and one linear attention layer. We show that sufficiently trained transformers can achieve -- and even improve upon -- the minimax optimal estimation risk in context.
arXiv Detail & Related papers (2024-08-22T08:02:10Z)
How Do Nonlinear Transformers Learn and Generalize in In-Context Learning? [82.51626700527837]
Transformer-based large language models displayed impressive in-context learning capabilities, where a pre-trained model can handle new tasks without fine-tuning. We analyze how the mechanics of how Transformer to achieve ICL contribute to the technical challenges of the training problems in Transformers.
arXiv Detail & Related papers (2024-02-23T21:07:20Z)
How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations [98.7450564309923]
This paper takes initial steps on understanding in-context learning (ICL) in more complex scenarios, by studying learning with representations. We construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function. We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size.
arXiv Detail & Related papers (2023-10-16T17:40:49Z)
Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL. We show that transformers can implement a broad class of standard machine learning algorithms in context. A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z)
Theoretical Characterization of the Generalization Performance of Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features. We find new and interesting properties that do not exist in single-task linear regression. Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.