The Quantization Model of Neural Scaling
- URL: http://arxiv.org/abs/2303.13506v3
- Date: Sat, 13 Jan 2024 23:51:39 GMT
- Title: The Quantization Model of Neural Scaling
- Authors: Eric J. Michaud, Ziming Liu, Uzay Girit, Max Tegmark
- Abstract summary: We propose the Quantization Model of neural scaling laws, explaining both the observed power law dropoff of loss with model and data size, and the sudden emergence of new capabilities with scale.
We show that when quanta are learned in order of decreasing use frequency, then a power law in use explains observed power law scaling of loss.
- Score: 19.057931064238584
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose the Quantization Model of neural scaling laws, explaining both the
observed power law dropoff of loss with model and data size, and also the
sudden emergence of new capabilities with scale. We derive this model from what
we call the Quantization Hypothesis, where network knowledge and skills are
"quantized" into discrete chunks ($\textbf{quanta}$). We show that when quanta
are learned in order of decreasing use frequency, then a power law in use
frequencies explains observed power law scaling of loss. We validate this
prediction on toy datasets, then study how scaling curves decompose for large
language models. Using language model gradients, we automatically decompose
model behavior into a diverse set of skills (quanta). We tentatively find that
the frequency at which these quanta are used in the training distribution
roughly follows a power law corresponding with the empirical scaling exponent
for language models, a prediction of our theory.
Related papers
- Neural Scaling Laws From Large-N Field Theory: Solvable Model Beyond the Ridgeless Limit [0.0]
We use large-N field theory methods to solve a model proposed by Maloney, Roberts and Sully.
We uncover a duality transformation at the diagrams level which explains the symmetry between model and training data set sizes.
arXiv Detail & Related papers (2024-05-29T18:00:01Z) - Power Failure Cascade Prediction using Graph Neural Networks [4.667031410586657]
We propose a flow-free model that predicts grid states at every generation of a cascade process given an initial contingency and power injection values.
We show that the proposed model reduces the computational time by almost two orders of magnitude.
arXiv Detail & Related papers (2024-04-24T18:45:50Z) - QGen: On the Ability to Generalize in Quantization Aware Training [35.0485699853394]
Quantization lowers memory usage, computational requirements, and latency by utilizing fewer bits to represent model weights and activations.
We develop a theoretical model for quantization in neural networks and demonstrate how quantization functions as a form of regularization.
arXiv Detail & Related papers (2024-04-17T21:52:21Z) - Neural Scaling Laws on Graphs [54.435688297561015]
We study neural scaling laws on graphs from both model and data perspectives.
For model scaling, we investigate the phenomenon of scaling law collapse and identify overfitting as the potential reason.
For data scaling, we suggest that the number of graphs can not effectively metric the graph data volume in scaling law since the sizes of different graphs are highly irregular.
arXiv Detail & Related papers (2024-02-03T06:17:21Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - A Solvable Model of Neural Scaling Laws [72.8349503901712]
Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws.
We propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology.
Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps.
arXiv Detail & Related papers (2022-10-30T15:13:18Z) - Multi-timescale Representation Learning in LSTM Language Models [69.98840820213937]
Language models must capture statistical dependencies between words at timescales ranging from very short to very long.
We derived a theory for how the memory gating mechanism in long short-term memory language models can capture power law decay.
Experiments showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution.
arXiv Detail & Related papers (2020-09-27T02:13:38Z) - Deducing neighborhoods of classes from a fitted model [68.8204255655161]
In this article a new kind of interpretable machine learning method is presented.
It can help to understand the partitioning of the feature space into predicted classes in a classification model using quantile shifts.
Basically, real data points (or specific points of interest) are used and the changes of the prediction after slightly raising or decreasing specific features are observed.
arXiv Detail & Related papers (2020-09-11T16:35:53Z) - UVeQFed: Universal Vector Quantization for Federated Learning [179.06583469293386]
Federated learning (FL) is an emerging approach to train such learning models without requiring the users to share their possibly private labeled data.
In FL, each user trains its copy of the learning model locally. The server then collects the individual updates and aggregates them into a global model.
We show that combining universal vector quantization methods with FL yields a decentralized training system in which the compression of the trained models induces only a minimum distortion.
arXiv Detail & Related papers (2020-06-05T07:10:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.