Related papers: Unifying Learning Dynamics and Generalization in Transformers Scaling Law

Unifying Learning Dynamics and Generalization in Transformers Scaling Law

URL: http://arxiv.org/abs/2512.22088v1
Date: Fri, 26 Dec 2025 17:20:09 GMT
Title: Unifying Learning Dynamics and Generalization in Transformers Scaling Law
Authors: Chiwun Yang,
Abstract summary: The scaling law, a cornerstone of Large Language Model (LLM) development, predicts improvements in model performance with increasing computational resources.<n>This work formalizes the learning dynamics of transformer-based language models as an ordinary differential equation (ODE) system.<n>Our analysis characterizes the convergence of generalization error to the irreducible risk as computational resources scale with data.
Score: 1.5229257192293202
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The scaling law, a cornerstone of Large Language Model (LLM) development, predicts improvements in model performance with increasing computational resources. Yet, while empirically validated, its theoretical underpinnings remain poorly understood. This work formalizes the learning dynamics of transformer-based language models as an ordinary differential equation (ODE) system, then approximates this process to kernel behaviors. Departing from prior toy-model analyses, we rigorously analyze stochastic gradient descent (SGD) training for multi-layer transformers on sequence-to-sequence data with arbitrary data distribution, closely mirroring real-world conditions. Our analysis characterizes the convergence of generalization error to the irreducible risk as computational resources scale with data, especially during the optimization process. We establish a theoretical upper bound on excess risk characterized by a distinct phase transition. In the initial optimization phase, the excess risk decays exponentially relative to the computational cost ${\sf C}$. However, once a specific resource allocation threshold is crossed, the system enters a statistical phase, where the generalization error follows a power-law decay of $Θ(\mathsf{C}^{-1/6})$. Beyond this unified framework, our theory derives isolated scaling laws for model size, training time, and dataset size, elucidating how each variable independently governs the upper bounds of generalization.

Related papers

A Simplified Analysis of SGD for Linear Regression with Weight Averaging [64.2393952273612]
Recent work bycitetzou 2021benign provides sharp rates for SGD optimization in linear regression using constant learning rate.<n>We provide a simplified analysis recovering the same bias and variance bounds provided incitepzou 2021benign based on simple linear algebra tools.<n>We believe our work makes the analysis of gradient descent on linear regression very accessible and will be helpful in further analyzing mini-batching and learning rate scheduling.
arXiv Detail & Related papers (2025-06-18T15:10:38Z)
In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data.<n>Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z)
Scaling Law for Stochastic Gradient Descent in Quadratically Parameterized Linear Regression [5.801904710149222]
In machine learning, the scaling law describes how the model performance improves with the model and data size scaling up.<n>This paper studies the scaling law over a linear regression with the model being quadratically parameterized.<n>As a result, in the canonical linear regression, we provide explicit separations for curves between generalization with and without feature learning, and the information-theoretical lower bound that is to parametrization method and the algorithm.
arXiv Detail & Related papers (2025-02-13T09:29:04Z)
Scaling and renormalization in high-dimensional regression [72.59731158970894]
We present a unifying perspective on recent results on ridge regression.<n>We use the basic tools of random matrix theory and free probability, aimed at readers with backgrounds in physics and deep learning.<n>Our results extend and provide a unifying perspective on earlier models of scaling laws.
arXiv Detail & Related papers (2024-05-01T15:59:00Z)
Understanding Augmentation-based Self-Supervised Representation Learning via RKHS Approximation and Regression [53.15502562048627]
Recent work has built the connection between self-supervised learning and the approximation of the top eigenspace of a graph Laplacian operator. This work delves into a statistical analysis of augmentation-based pretraining.
arXiv Detail & Related papers (2023-06-01T15:18:55Z)
Koopman Kernel Regression [6.116741319526748]
We show that Koopman operator theory offers a beneficial paradigm for characterizing forecasts via linear time-invariant (LTI) ODEs. We derive a universal Koopman-invariant kernel reproducing Hilbert space (RKHS) that solely spans transformations into LTI dynamical systems. Our experiments demonstrate superior forecasting performance compared to Koopman operator and sequential data predictors.
arXiv Detail & Related papers (2023-05-25T16:22:22Z)
Towards Data-Algorithm Dependent Generalization: a Case Study on Overparameterized Linear Regression [19.047997113063147]
We introduce a notion called data-algorithm compatibility, which considers the generalization behavior of the entire data-dependent training trajectory. We perform a data-dependent trajectory analysis and derive a sufficient condition for compatibility in such a setting.
arXiv Detail & Related papers (2022-02-12T12:42:36Z)
Fractal Structure and Generalization Properties of Stochastic Optimization Algorithms [71.62575565990502]
We prove that the generalization error of an optimization algorithm can be bounded on the complexity' of the fractal structure that underlies its generalization measure. We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden/layered neural networks) and algorithms.
arXiv Detail & Related papers (2021-06-09T08:05:36Z)
Out-of-Distribution Generalization in Kernel Regression [21.958028127426196]
We study generalization in kernel regression when the training and test distributions are different. We identify an overlap matrix that quantifies the mismatch between distributions for a given kernel. We develop procedures for optimizing training and test distributions for a given data budget to find best and worst case generalizations under the shift.
arXiv Detail & Related papers (2021-06-04T04:54:25Z)
Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear. We show that it commonly arises in parameters of discrete multiplicative noise due to variance. A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.