An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws
- URL: http://arxiv.org/abs/2212.01365v2
- Date: Wed, 18 Oct 2023 20:53:04 GMT
- Title: An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws
- Authors: Hong Jun Jeon, Benjamin Van Roy
- Abstract summary: We study the compute-optimal trade-off between model and training data set sizes for large neural networks.
Our result suggests a linear relation similar to that supported by the empirical analysis of chinchilla.
- Score: 24.356906682593532
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the compute-optimal trade-off between model and training data set
sizes for large neural networks. Our result suggests a linear relation similar
to that supported by the empirical analysis of chinchilla. While that work
studies transformer-based large language models trained on the MassiveText
corpus gopher, as a starting point for development of a mathematical theory, we
focus on a simpler learning model and data generating process, each based on a
neural network with a sigmoidal output unit and single hidden layer of ReLU
activation units. We introduce general error upper bounds for a class of
algorithms which incrementally update a statistic (for example gradient
descent). For a particular learning model inspired by barron 1993, we establish
an upper bound on the minimal information-theoretically achievable expected
error as a function of model and data set sizes. We then derive allocations of
computation that minimize this bound. We present empirical results which
suggest that this approximation correctly identifies an asymptotic linear
compute-optimal scaling. This approximation also generates new insights. Among
other things, it suggests that, as the input dimension or latent space
complexity grows, as might be the case for example if a longer history of
tokens is taken as input to a language model, a larger fraction of the compute
budget should be allocated to growing the learning model rather than training
data.
Related papers
- Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference [55.150117654242706]
We show that model selection for computation-aware GPs trained on 1.8 million data points can be done within a few hours on a single GPU.
As a result of this work, Gaussian processes can be trained on large-scale datasets without significantly compromising their ability to quantify uncertainty.
arXiv Detail & Related papers (2024-11-01T21:11:48Z) - The Persian Rug: solving toy models of superposition using large-scale symmetries [0.0]
We present a complete mechanistic description of the algorithm learned by a minimal non-linear sparse data autoencoder in the limit of large input dimension.
Our work contributes to neural network interpretability by introducing techniques for understanding the structure of autoencoders.
arXiv Detail & Related papers (2024-10-15T22:52:45Z) - Information-Theoretic Foundations for Neural Scaling Laws [20.617552198581024]
We develop information-theoretic foundations for neural scaling laws.
We observe that the optimal relation between data and model size is linear, up to logarithmic factors.
arXiv Detail & Related papers (2024-06-28T02:20:54Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - What learning algorithm is in-context learning? Investigations with
linear models [87.91612418166464]
We investigate the hypothesis that transformer-based in-context learners implement standard learning algorithms implicitly.
We show that trained in-context learners closely match the predictors computed by gradient descent, ridge regression, and exact least-squares regression.
Preliminary evidence that in-context learners share algorithmic features with these predictors.
arXiv Detail & Related papers (2022-11-28T18:59:51Z) - Scaling Structured Inference with Randomization [64.18063627155128]
We propose a family of dynamic programming (RDP) randomized for scaling structured models to tens of thousands of latent states.
Our method is widely applicable to classical DP-based inference.
It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly.
arXiv Detail & Related papers (2021-12-07T11:26:41Z) - Towards Open-World Feature Extrapolation: An Inductive Graph Learning
Approach [80.8446673089281]
We propose a new learning paradigm with graph representation and learning.
Our framework contains two modules: 1) a backbone network (e.g., feedforward neural nets) as a lower model takes features as input and outputs predicted labels; 2) a graph neural network as an upper model learns to extrapolate embeddings for new features via message passing over a feature-data graph built from observed data.
arXiv Detail & Related papers (2021-10-09T09:02:45Z) - Relative gradient optimization of the Jacobian term in unsupervised deep
learning [9.385902422987677]
Learning expressive probabilistic models correctly describing the data is a ubiquitous problem in machine learning.
Deep density models have been widely used for this task, but their maximum likelihood based training requires estimating the log-determinant of the Jacobian.
We propose a new approach for exact training of such neural networks.
arXiv Detail & Related papers (2020-06-26T16:41:08Z) - The Gaussian equivalence of generative models for learning with shallow
neural networks [30.47878306277163]
We study the performance of neural networks trained on data drawn from pre-trained generative models.
We provide three strands of rigorous, analytical and numerical evidence corroborating this equivalence.
These results open a viable path to the theoretical study of machine learning models with realistic data.
arXiv Detail & Related papers (2020-06-25T21:20:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.