Related papers: Scaling Laws and In-Context Learning: A Unified Theoretical Framework

Scaling Laws and In-Context Learning: A Unified Theoretical Framework

URL: http://arxiv.org/abs/2511.06232v1
Date: Sun, 09 Nov 2025 05:19:14 GMT
Title: Scaling Laws and In-Context Learning: A Unified Theoretical Framework
Authors: Sushant Mehta, Ishan Gupta,
Abstract summary: In-context learning (ICL) enables large language models to adapt to new tasks from demonstrations without parameter updates.<n>We present a unified theoretical framework connecting scaling laws to ICL emergence in transformers.<n>We show that ICL performance follows power-law relationships with model depth $L$, width $d$, context length $k$, and training data $D$, with exponents determined by task structure.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In-context learning (ICL) enables large language models to adapt to new tasks from demonstrations without parameter updates. Despite extensive empirical studies, a principled understanding of ICL emergence at scale remains more elusive. We present a unified theoretical framework connecting scaling laws to ICL emergence in transformers. Our analysis establishes that ICL performance follows power-law relationships with model depth $L$, width $d$, context length $k$, and training data $D$, with exponents determined by task structure. We show that under specific conditions, transformers implement gradient-based metalearning in their forward pass, with an effective learning rate $\eta_{\text{eff}} = \Theta(1/\sqrt{Ld})$. We demonstrate sharp phase transitions at critical scales and derive optimal depth-width allocations favoring $L^* \propto N^{2/3}$, $d^* \propto N^{1/3}$ for the fixed parameter budget $N = Ld$. Systematic experiments on synthetic tasks validate our predictions, with measured scaling exponents closely matching theory. This work provides both necessary and sufficient conditions for the emergence of ICLs and establishes fundamental computational limits on what transformers can learn in-context.

Related papers

Provable Adversarial Robustness in In-Context Learning [8.201374511929538]
Large language models adapt to new tasks through in-context learning (ICL) without parameter updates.<n>Current theoretical explanations for this capability assume test tasks are drawn from a distribution similar to that seen during pretraining.<n>We introduce a distributionally robust meta-learning framework that provides worst-case performance guarantees for ICL under Wasserstein-based distribution shifts.
arXiv Detail & Related papers (2026-02-19T12:37:00Z)
How to Set the Learning Rate for Large-Scale Pre-training? [73.03133634525635]
We formalize this investigation into two distinct research paradigms: Fitting and Transfer.<n>Within the Fitting Paradigm, we introduce a Scaling Law for search factor, effectively reducing the search complexity from O(n3) to O(n*C_D*C_) via predictive modeling.<n>We extend the principles of $$Transfer to the Mixture of Experts (MoE) architecture, broadening its applicability to encompass model depth, weight decay, and token horizons.
arXiv Detail & Related papers (2026-01-08T15:55:13Z)
Test time training enhances in-context learning of nonlinear functions [51.56484100374058]
Test-time training (TTT) enhances model performance by explicitly updating designated parameters prior to each prediction.<n>We investigate the combination of TTT with in-context learning (ICL), where the model is given a few examples from the target distribution at inference time.
arXiv Detail & Related papers (2025-09-30T03:56:44Z)
On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks. We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z)
Bayesian scaling laws for in-context learning [85.34114399339741]
In-context learning (ICL) is a powerful technique for getting language models to perform complex tasks with no training updates.<n>We show that ICL approximates a Bayesian learner, which gives rise to a novel Bayesian scaling law for ICL.<n>Our scaling law matches existing scaling laws in accuracy while also offering interpretable terms for task priors, learning efficiency, and per-example probabilities.
arXiv Detail & Related papers (2024-10-21T21:45:22Z)
Can Transformers Learn $n$-gram Language Models? [77.35809823602307]
We study transformers' ability to learn random $n$-gram LMs of two kinds. We find that classic estimation techniques for $n$-gram LMs such as add-$lambda$ smoothing outperform transformers.
arXiv Detail & Related papers (2024-10-03T21:21:02Z)
Transformers are Minimax Optimal Nonparametric In-Context Learners [36.291980654891496]
In-context learning of large language models has proven to be a surprisingly effective method of learning a new task from only a few demonstrative examples. We develop approximation and generalization error bounds for a transformer composed of a deep neural network and one linear attention layer. We show that sufficiently trained transformers can achieve -- and even improve upon -- the minimax optimal estimation risk in context.
arXiv Detail & Related papers (2024-08-22T08:02:10Z)
Exact Conversion of In-Context Learning to Model Weights in Linearized-Attention Transformers [30.145669421100965]
In-Context Learning is a powerful emergent property of large language models. We show that, for linearized transformer networks, ICL can be made explicit and permanent through the inclusion of bias terms. Our algorithm (ICLCA) allows for exact conversion in an inexpensive manner.
arXiv Detail & Related papers (2024-06-05T01:47:40Z)
Towards Better Understanding of In-Context Learning Ability from In-Context Uncertainty Quantification [7.869708570399577]
We consider a bi-objective prediction task of predicting both the conditional expectation $mathbbE[Y|X]$ and the conditional variance Var$(Y|X)$. Theoretically, we show that the trained Transformer reaches near Bayes-optimum, suggesting the usage of the information of the training distribution.
arXiv Detail & Related papers (2024-05-24T00:08:55Z)
Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training. We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.