Compute-Optimal LLMs Provably Generalize Better With Scale
- URL: http://arxiv.org/abs/2504.15208v1
- Date: Mon, 21 Apr 2025 16:26:56 GMT
- Title: Compute-Optimal LLMs Provably Generalize Better With Scale
- Authors: Marc Finzi, Sanyam Kapoor, Diego Granziol, Anming Gu, Christopher De Sa, J. Zico Kolter, Andrew Gordon Wilson,
- Abstract summary: We develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime.<n>We introduce a novel, fully empirical Freedman-type martingale concentration that tightens existing bounds by accounting for the variance of the loss function.<n>We produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.
- Score: 102.29926217670926
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Why do larger language models generalize better? To investigate this question, we develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime, as described by the Chinchilla scaling laws. We introduce a novel, fully empirical Freedman-type martingale concentration inequality that tightens existing bounds by accounting for the variance of the loss function. This generalization bound can be decomposed into three interpretable components: the number of parameters per token, the loss variance, and the quantization error at a fixed bitrate. As compute-optimal language models are scaled up, the number of parameters per data point remains constant; however, both the loss variance and the quantization error decrease, implying that larger models should have smaller generalization gaps. We examine why larger models tend to be more quantizable from an information theoretic perspective, showing that the rate at which they can integrate new information grows more slowly than their capacity on the compute-optimal frontier. From these findings we produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.
Related papers
- The Hidden Influence of Latent Feature Magnitude When Learning with Imbalanced Data [22.521678971526253]
We show that one of the central causes of impaired generalization when learning with imbalanced data is the inherent manner in which ML models perform inference.
We demonstrate that even with aggressive data augmentation, which generally improves minority class prediction accuracy, parametric ML models still associate a class label with a limited number of feature combinations.
arXiv Detail & Related papers (2024-07-14T11:20:50Z) - Scaling Laws in Linear Regression: Compute, Parameters, and Data [86.48154162485712]
We study the theory of scaling laws in an infinite dimensional linear regression setup.
We show that the reducible part of the test error is $Theta(-(a-1) + N-(a-1)/a)$.
Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.
arXiv Detail & Related papers (2024-06-12T17:53:29Z) - Scaling and renormalization in high-dimensional regression [72.59731158970894]
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models.
We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning.
arXiv Detail & Related papers (2024-05-01T15:59:00Z) - Non-Vacuous Generalization Bounds for Large Language Models [78.42762571499061]
We provide the first non-vacuous generalization bounds for pretrained large language models.
We show that larger models have better generalization bounds and are more compressible than smaller models.
arXiv Detail & Related papers (2023-12-28T17:58:42Z) - Two Phases of Scaling Laws for Nearest Neighbor Classifiers [18.93620861346151]
A fast scaling law implies that one can solve machine learning problems by simply boosting the data and the model sizes.
We show that a scaling law can have two phases: in the first phase, the generalization error depends exponentially on the data dimension and decreases fast.
arXiv Detail & Related papers (2023-08-16T09:28:55Z) - Just a Matter of Scale? Reevaluating Scale Equivariance in Convolutional
Neural Networks [3.124871781422893]
Convolutional networks are not equivariant to variations in scale and fail to generalize to objects of different sizes.
We introduce a new family of models that applies many re-scaled kernels with shared weights in parallel and then selects the most appropriate one.
Our experimental results on STIR show that both the existing and proposed approaches can improve generalization across scales compared to standard convolutions.
arXiv Detail & Related papers (2022-11-18T15:27:05Z) - Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks.
We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space.
We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z) - Information-Theoretic Generalization Bounds for Iterative
Semi-Supervised Learning [81.1071978288003]
In particular, we seek to understand the behaviour of the em generalization error of iterative SSL algorithms using information-theoretic principles.
Our theoretical results suggest that when the class conditional variances are not too large, the upper bound on the generalization error decreases monotonically with the number of iterations, but quickly saturates.
arXiv Detail & Related papers (2021-10-03T05:38:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.