Related papers: The Quantization Model of Neural Scaling

Related papers

Broken neural scaling laws in materials science [0.0]
In materials science, data are scarce and expensive to generate, whether computationally or experimentally.<n>It is crucial to identify how model performance scales with dataset size and model capacity to distinguish between data- and model-limited regimes.<n>Here, we investigate neural scaling laws for a paradigmatic materials science task: predicting the dielectric function of metals.
arXiv Detail & Related papers (2026-02-05T14:27:08Z)
Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks [59.552873049024775]
We show that compute-optimally trained models exhibit a remarkably precise universality.<n>With learning rate decay, the collapse becomes so tight that differences in the normalized curves across models fall below the noise floor.<n>We explain these phenomena by connecting collapse to the power-law structure in typical neural scaling laws.
arXiv Detail & Related papers (2025-07-02T20:03:34Z)
Superposition Yields Robust Neural Scaling [9.278468089636547]
We study the origin of the neural scaling law -- the finding that loss decreases as a power law with model size.<n>We found that when superposition is weak, meaning only the most frequent features are represented without interference, the scaling of loss with model size depends on the underlying feature frequency.<n>We conclude that representation superposition is an important mechanism underlying the observed neural scaling laws.
arXiv Detail & Related papers (2025-05-15T16:18:13Z)
Compute-Optimal LLMs Provably Generalize Better With Scale [102.29926217670926]
We develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime. We introduce a novel, fully empirical Freedman-type martingale concentration that tightens existing bounds by accounting for the variance of the loss function. We produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.
arXiv Detail & Related papers (2025-04-21T16:26:56Z)
Do Larger Language Models Generalize Better? A Scaling Law for Implicit Reasoning at Pretraining Time [73.22651918134808]
This work shows counterintuitive effects of model size scaling and provides new insights into the relationship between scaling and reasoning in language models (LMs)<n>We pretrain LMs from scratch on a synthetic implicit multihop reasoning environment designed to replicate the structure and distribution of real-world large-scale knowledge graphs.<n>We then assess the LMs' ability to complete the missing edges in the graph, which requires multi-hop reasoning that can be viewed as a simplification of implicit reasoning during real-world pretraining.
arXiv Detail & Related papers (2025-04-04T17:57:22Z)
Neural Scaling Laws Rooted in the Data Distribution [0.0]
Deep neural networks exhibit empirical neural scaling laws, with error decreasing as a power law with increasing model or data size. We develop a mathematical model intended to describe natural datasets using percolation theory. We test the theory by training regression models on toy datasets derived from percolation theory simulations.
arXiv Detail & Related papers (2024-12-10T22:01:38Z)
Understanding Scaling Laws with Statistical and Approximation Theory for Transformer Neural Networks on Intrinsically Low-dimensional Data [4.481230230086981]
In deep neural networks, a model's generalization error is often observed to follow a power scaling law dependent both on the model size and the data size. We show that our theory predicts a power law between the generalization error and both the training data size and the network size for transformers. By leveraging low-dimensional data structures under a manifold hypothesis, we are able to explain transformer scaling laws in a way which respects the data geometry.
arXiv Detail & Related papers (2024-11-11T01:05:28Z)
Neural Scaling Laws From Large-N Field Theory: Solvable Model Beyond the Ridgeless Limit [0.0]
We use large-N field theory methods to solve a model proposed by Maloney, Roberts and Sully. We uncover a duality transformation at the diagrams level which explains the symmetry between model and training data set sizes.
arXiv Detail & Related papers (2024-05-29T18:00:01Z)
Scaling and renormalization in high-dimensional regression [72.59731158970894]
We present a unifying perspective on recent results on ridge regression.<n>We use the basic tools of random matrix theory and free probability, aimed at readers with backgrounds in physics and deep learning.<n>Our results extend and provide a unifying perspective on earlier models of scaling laws.
arXiv Detail & Related papers (2024-05-01T15:59:00Z)
Power Failure Cascade Prediction using Graph Neural Networks [4.667031410586657]
We propose a flow-free model that predicts grid states at every generation of a cascade process given an initial contingency and power injection values. We show that the proposed model reduces the computational time by almost two orders of magnitude.
arXiv Detail & Related papers (2024-04-24T18:45:50Z)
QGen: On the Ability to Generalize in Quantization Aware Training [35.0485699853394]
Quantization lowers memory usage, computational requirements, and latency by utilizing fewer bits to represent model weights and activations. We develop a theoretical model for quantization in neural networks and demonstrate how quantization functions as a form of regularization.
arXiv Detail & Related papers (2024-04-17T21:52:21Z)
Neural Scaling Laws on Graphs [54.435688297561015]
We study neural scaling laws on graphs from both model and data perspectives. For model scaling, we investigate the phenomenon of scaling law collapse and identify overfitting as the potential reason. For data scaling, we suggest that the number of graphs can not effectively metric the graph data volume in scaling law since the sizes of different graphs are highly irregular.
arXiv Detail & Related papers (2024-02-03T06:17:21Z)
A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization. Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z)
A Solvable Model of Neural Scaling Laws [72.8349503901712]
Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws. We propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology. Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps.
arXiv Detail & Related papers (2022-10-30T15:13:18Z)
Multi-timescale Representation Learning in LSTM Language Models [69.98840820213937]
Language models must capture statistical dependencies between words at timescales ranging from very short to very long. We derived a theory for how the memory gating mechanism in long short-term memory language models can capture power law decay. Experiments showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution.
arXiv Detail & Related papers (2020-09-27T02:13:38Z)
Deducing neighborhoods of classes from a fitted model [68.8204255655161]
In this article a new kind of interpretable machine learning method is presented. It can help to understand the partitioning of the feature space into predicted classes in a classification model using quantile shifts. Basically, real data points (or specific points of interest) are used and the changes of the prediction after slightly raising or decreasing specific features are observed.
arXiv Detail & Related papers (2020-09-11T16:35:53Z)
UVeQFed: Universal Vector Quantization for Federated Learning [179.06583469293386]
Federated learning (FL) is an emerging approach to train such learning models without requiring the users to share their possibly private labeled data. In FL, each user trains its copy of the learning model locally. The server then collects the individual updates and aggregates them into a global model. We show that combining universal vector quantization methods with FL yields a decentralized training system in which the compression of the trained models induces only a minimum distortion.
arXiv Detail & Related papers (2020-06-05T07:10:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.