4+3 Phases of Compute-Optimal Neural Scaling Laws
- URL: http://arxiv.org/abs/2405.15074v3
- Date: Fri, 18 Apr 2025 21:56:39 GMT
- Title: 4+3 Phases of Compute-Optimal Neural Scaling Laws
- Authors: Elliot Paquette, Courtney Paquette, Lechao Xiao, Jeffrey Pennington,
- Abstract summary: We consider the solvable neural scaling model with three parameters: data complexity, target complexity, and model- parameters.<n>We derive new predictions about the infinite-data scaling law regime using this model.
- Score: 31.72805124311781
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the solvable neural scaling model with three parameters: data complexity, target complexity, and model-parameter-count. We use this neural scaling model to derive new predictions about the compute-limited, infinite-data scaling law regime. To train the neural scaling model, we run one-pass stochastic gradient descent on a mean-squared loss. We derive a representation of the loss curves which holds over all iteration counts and improves in accuracy as the model parameter count grows. We then analyze the compute-optimal model-parameter-count, and identify 4 phases (+3 subphases) in the data-complexity/target-complexity phase-plane. The phase boundaries are determined by the relative importance of model capacity, optimizer noise, and embedding of the features. We furthermore derive, with mathematical proof and extensive numerical evidence, the scaling-law exponents in all of these phases, in particular computing the optimal model-parameter-count as a function of floating point operation budget.
Related papers
- Complexity Scaling Laws for Neural Models using Combinatorial Optimization [3.4585775092874163]
We develop scaling laws based on problem complexity.<n>We analyze two fundamental complexity measures: solution space size and representation space size.<n>We show that optimization promotes smooth cost trends, and therefore meaningful scaling laws can be obtained even in the absence of an interpretable loss.
arXiv Detail & Related papers (2025-06-15T18:20:35Z) - Accelerated zero-order SGD under high-order smoothness and overparameterized regime [79.85163929026146]
We present a novel gradient-free algorithm to solve convex optimization problems.
Such problems are encountered in medicine, physics, and machine learning.
We provide convergence guarantees for the proposed algorithm under both types of noise.
arXiv Detail & Related papers (2024-11-21T10:26:17Z) - Geometric Algebra Planes: Convex Implicit Neural Volumes [70.12234371845445]
We show that GA-Planes is equivalent to a sparse low-rank factor plus low-resolution matrix.
We also show that GA-Planes can be adapted for many existing representations.
arXiv Detail & Related papers (2024-11-20T18:21:58Z) - Scaling and renormalization in high-dimensional regression [72.59731158970894]
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models.
We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning.
arXiv Detail & Related papers (2024-05-01T15:59:00Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Stochastic Gradient Descent for Nonparametric Regression [11.24895028006405]
This paper introduces an iterative algorithm for training nonparametric additive models.
We show that the resulting inequality satisfies an oracle that allows for model mis-specification.
arXiv Detail & Related papers (2024-01-01T08:03:52Z) - Stochastic Marginal Likelihood Gradients using Neural Tangent Kernels [78.6096486885658]
We introduce lower bounds to the linearized Laplace approximation of the marginal likelihood.
These bounds are amenable togradient-based optimization and allow to trade off estimation accuracy against computational complexity.
arXiv Detail & Related papers (2023-06-06T19:02:57Z) - An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws [24.356906682593532]
We study the compute-optimal trade-off between model and training data set sizes for large neural networks.
Our result suggests a linear relation similar to that supported by the empirical analysis of chinchilla.
arXiv Detail & Related papers (2022-12-02T18:46:41Z) - A Solvable Model of Neural Scaling Laws [72.8349503901712]
Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws.
We propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology.
Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps.
arXiv Detail & Related papers (2022-10-30T15:13:18Z) - An Exponentially Increasing Step-size for Parameter Estimation in
Statistical Models [37.63410634069547]
We propose to exponentially increase the step-size of the Gaussian descent (GD) algorithm.
We then consider using the EGD algorithm for solving parameter estimation under non-regular statistical models.
The total computational complexity of the EGD algorithm is emphoptimal and exponentially cheaper than that of the GD for solving parameter estimation in non-regular statistical models.
arXiv Detail & Related papers (2022-05-16T21:36:22Z) - Investigating the Relationship Between Dropout Regularization and Model
Complexity in Neural Networks [0.0]
Dropout Regularization serves to reduce variance in Deep Learning models.
We explore the relationship between the dropout rate and model complexity by training 2,000 neural networks.
We build neural networks that predict the optimal dropout rate given the number of hidden units in each dense layer.
arXiv Detail & Related papers (2021-08-14T23:49:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.