A Solvable Model of Neural Scaling Laws
- URL: http://arxiv.org/abs/2210.16859v1
- Date: Sun, 30 Oct 2022 15:13:18 GMT
- Title: A Solvable Model of Neural Scaling Laws
- Authors: Alexander Maloney, Daniel A. Roberts, James Sully
- Abstract summary: Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws.
We propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology.
Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps.
- Score: 72.8349503901712
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models with a huge number of parameters, when trained on near
internet-sized number of tokens, have been empirically shown to obey neural
scaling laws: specifically, their performance behaves predictably as a power
law in either parameters or dataset size until bottlenecked by the other
resource. To understand this better, we first identify the necessary properties
allowing such scaling laws to arise and then propose a statistical model -- a
joint generative data model and random feature model -- that captures this
neural scaling phenomenology. By solving this model in the dual limit of large
training set size and large number of parameters, we gain insight into (i) the
statistical structure of datasets and tasks that lead to scaling laws, (ii) the
way nonlinear feature maps, such as those provided by neural networks, enable
scaling laws when trained on these datasets, (iii) the optimality of the
equiparameterization scaling of training sets and parameters, and (iv) whether
such scaling laws can break down and how they behave when they do. Key findings
are the manner in which the power laws that occur in the statistics of natural
datasets are extended by nonlinear random feature maps and then translated into
power-law scalings of the test loss and how the finite extent of the data's
spectral power law causes the model's performance to plateau.
Related papers
- Information-Theoretic Foundations for Neural Scaling Laws [20.617552198581024]
We develop information-theoretic foundations for neural scaling laws.
We observe that the optimal relation between data and model size is linear, up to logarithmic factors.
arXiv Detail & Related papers (2024-06-28T02:20:54Z) - Scaling Laws for the Value of Individual Data Points in Machine Learning [55.596413470429475]
We introduce a new perspective by investigating scaling behavior for the value of individual data points.
We provide learning theory to support our scaling law, and we observe empirically that it holds across diverse model classes.
Our work represents a first step towards understanding and utilizing scaling properties for the value of individual data points.
arXiv Detail & Related papers (2024-05-30T20:10:24Z) - Neural Scaling Laws From Large-N Field Theory: Solvable Model Beyond the Ridgeless Limit [0.0]
We use large-N field theory methods to solve a model proposed by Maloney, Roberts and Sully.
We uncover a duality transformation at the diagrams level which explains the symmetry between model and training data set sizes.
arXiv Detail & Related papers (2024-05-29T18:00:01Z) - Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models.
We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models.
We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z) - Scaling and renormalization in high-dimensional regression [72.59731158970894]
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models.
We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning.
arXiv Detail & Related papers (2024-05-01T15:59:00Z) - Scaling Laws For Dense Retrieval [22.76001461620846]
We investigate whether the performance of dense retrieval models follows the scaling law as other neural models.
Results indicate that, under our settings, the performance of dense retrieval models follows a precise power-law scaling related to the model size and the number of annotations.
arXiv Detail & Related papers (2024-03-27T15:27:36Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Scaling Laws for Sparsely-Connected Foundation Models [70.41266138010657]
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets.
We identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data.
arXiv Detail & Related papers (2023-09-15T16:29:27Z) - Scaling Laws Under the Microscope: Predicting Transformer Performance
from Small Scale Experiments [42.793379799720434]
We investigate whether scaling laws can be used to accelerate model development.
We find that scaling laws emerge at finetuning time in some NLP tasks.
For tasks where scaling laws exist, they can be used to predict the performance of larger models.
arXiv Detail & Related papers (2022-02-13T19:13:00Z) - Explaining Neural Scaling Laws [17.115592382420626]
Population loss of trained deep neural networks often follows precise power-law scaling relations.
We propose a theory that explains the origins of and connects these scaling laws.
We identify variance-limited and resolution-limited scaling behavior for both dataset and model size.
arXiv Detail & Related papers (2021-02-12T18:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.