A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
Abstract Overview
This paper develops a mathematical framework for foundation models by defining a performance function over data size, model size, and training steps, and interpreting emergent intelligence as the existence of its limit as these quantities grow. The analysis introduces a "limit architecture," an infinite-depth composition of basic functional blocks, and studies its existence using nonlinear Lipschitz operator theory. Within this framework, the paper decomposes performance error into optimization, architecture, and sample components to connect asymptotic behavior with scaling laws. It also provides empirical checks, including comparisons of GPT-1 and GPT-2 style architectures and layerwise analyses of several open-source models, to examine whether the proposed theoretical conditions are reflected in practice.
Novelty
The distinctive contribution is a limit-theoretic formulation of emergent intelligence that ties both emergence and scaling laws to the existence and convergence behavior of an infinite-dimensional limit architecture. The paper further introduces the Lip constant of nonlinear operators as a central criterion, establishing necessary and sufficient conditions for when such a limit architecture exists.
Results
The paper proves that, under its assumptions, foundation models exhibit emergent intelligence when the performance limit exists, and it derives scaling laws in which training-step and model-size effects are exponential while data-size effects are power-law, yielding an overall bound of the form β^K + Lip(T)^P + N^{-1/2}. Empirically, the authors report that a 1B GPT-2 style model outperforms a matched GPT-1 style model on their benchmark average (47.57% vs. 34.89%), and they show layerwise evidence consistent with the proposed condensing property in Llama-3.1, Qwen-2, and DeepSeek-MoE models.
Key Points
- Emergent intelligence is formalized as the existence of the limit of a performance function as data size, model size, and training steps all tend to infinity.
- The theory identifies necessary and sufficient conditions for limit-architecture existence based on eventual Lip(T_i) ≤ 1 and convergence of blocks toward a projection operator with summable deviations.
- Empirical analyses connect the theory to practice by showing greater training stability for pre-LayerNorm (GPT-2 style) models via Lip constant tracking and by observing condensing behavior in Llama-3.1, Qwen-2, and DeepSeek-MoE model families.