FuguReport

A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws

Authors Jun Shu, Junxiong Jia, Deyu Meng, Zongben Xu
Affiliations Xi’an Jiaotong University
Categories Theory / Mathematical Modeling / Mathematical formulation of emergent intelligence, Method / Limit Theory / Limit behavior analysis for foundation models, Evaluation / Scaling Laws / Scaling behavior characterization
License CC BY 4.0

Abstract Overview

This paper develops a mathematical framework for foundation models by defining a performance function over data size, model size, and training steps, and interpreting emergent intelligence as the existence of its limit as these quantities grow. The analysis introduces a "limit architecture," an infinite-depth composition of basic functional blocks, and studies its existence using nonlinear Lipschitz operator theory. Within this framework, the paper decomposes performance error into optimization, architecture, and sample components to connect asymptotic behavior with scaling laws. It also provides empirical checks, including comparisons of GPT-1 and GPT-2 style architectures and layerwise analyses of several open-source models, to examine whether the proposed theoretical conditions are reflected in practice.

Novelty

The distinctive contribution is a limit-theoretic formulation of emergent intelligence that ties both emergence and scaling laws to the existence and convergence behavior of an infinite-dimensional limit architecture. The paper further introduces the Lip constant of nonlinear operators as a central criterion, establishing necessary and sufficient conditions for when such a limit architecture exists.

Results

The paper proves that, under its assumptions, foundation models exhibit emergent intelligence when the performance limit exists, and it derives scaling laws in which training-step and model-size effects are exponential while data-size effects are power-law, yielding an overall bound of the form β^K + Lip(T)^P + N^{-1/2}. Empirically, the authors report that a 1B GPT-2 style model outperforms a matched GPT-1 style model on their benchmark average (47.57% vs. 34.89%), and they show layerwise evidence consistent with the proposed condensing property in Llama-3.1, Qwen-2, and DeepSeek-MoE models.

Key Points

  1. Emergent intelligence is formalized as the existence of the limit of a performance function as data size, model size, and training steps all tend to infinity.
  2. The theory identifies necessary and sufficient conditions for limit-architecture existence based on eventual Lip(T_i) ≤ 1 and convergence of blocks toward a projection operator with summable deviations.
  3. Empirical analyses connect the theory to practice by showing greater training stability for pre-LayerNorm (GPT-2 style) models via Lip constant tracking and by observing condensing behavior in Llama-3.1, Qwen-2, and DeepSeek-MoE model families.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.