Broken Neural Scaling Laws
- URL: http://arxiv.org/abs/2210.14891v9
- Date: Fri, 24 Mar 2023 17:56:23 GMT
- Title: Broken Neural Scaling Laws
- Authors: Ethan Caballero, Kshitij Gupta, Irina Rish, David Krueger
- Abstract summary: Broken Neural Scaling Law (BNSL) accurately models and extrapolates the scaling behaviors of deep neural networks.
This set includes large-scale vision, language, audio, video, diffusion, generative modeling, multimodal learning, contrastive learning, AI alignment, robotics, out-of-distribution (OOD) generalization.
- Score: 9.020652910657931
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a smoothly broken power law functional form (referred to by us as
a Broken Neural Scaling Law (BNSL)) that accurately models and extrapolates the
scaling behaviors of deep neural networks (i.e. how the evaluation metric of
interest varies as the amount of compute used for training, number of model
parameters, training dataset size, model input size, number of training steps,
or upstream performance varies) for various architectures and for each of
various tasks within a large and diverse set of upstream and downstream tasks,
in zero-shot, prompted, and fine-tuned settings. This set includes large-scale
vision, language, audio, video, diffusion, generative modeling, multimodal
learning, contrastive learning, AI alignment, robotics, out-of-distribution
(OOD) generalization, continual learning, transfer learning, uncertainty
estimation / calibration, out-of-distribution detection, adversarial
robustness, distillation, sparsity, retrieval, quantization, pruning,
molecules, computer programming/coding, math word problems, arithmetic,
unsupervised/self-supervised learning, and reinforcement learning (single agent
and multi-agent). When compared to other functional forms for neural scaling
behavior, this functional form yields extrapolations of scaling behavior that
are considerably more accurate on this set. Moreover, this functional form
accurately models and extrapolates scaling behavior that other functional forms
are incapable of expressing such as the non-monotonic transitions present in
the scaling behavior of phenomena such as double descent and the delayed, sharp
inflection points (often called "emergent phase transitions") present in the
scaling behavior of tasks such as arithmetic. Lastly, we use this functional
form to glean insights about the limit of the predictability of scaling
behavior. Code is available at
https://github.com/ethancaballero/broken_neural_scaling_laws
Related papers
- How Feature Learning Can Improve Neural Scaling Laws [86.9540615081759]
We develop a solvable model of neural scaling laws beyond the kernel limit.
We show how performance scales with model size, training time, and the total amount of available data.
arXiv Detail & Related papers (2024-09-26T14:05:32Z) - Information-Theoretic Foundations for Neural Scaling Laws [20.617552198581024]
We develop information-theoretic foundations for neural scaling laws.
We observe that the optimal relation between data and model size is linear, up to logarithmic factors.
arXiv Detail & Related papers (2024-06-28T02:20:54Z) - Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models.
We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models.
We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z) - An exactly solvable model for emergence and scaling laws in the multitask sparse parity problem [2.598133279943607]
We present a framework where each new ability (a skill) is represented as a basis function.
We find analytic expressions for the emergence of new skills, as well as for scaling laws of the loss with training time, data size, model size, and optimal compute.
Our simple model captures, using a single fit parameter, the sigmoidal emergence of multiple new skills as training time, data size or model size increases in the neural network.
arXiv Detail & Related papers (2024-04-26T17:45:32Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Latent State Models of Training Dynamics [51.88132043461152]
We train models with different random seeds and compute a variety of metrics throughout training.
We then fit a hidden Markov model (HMM) over the resulting sequences of metrics.
We use the HMM representation to study phase transitions and identify latent "detour" states that slow down convergence.
arXiv Detail & Related papers (2023-08-18T13:20:08Z) - A Solvable Model of Neural Scaling Laws [72.8349503901712]
Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws.
We propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology.
Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps.
arXiv Detail & Related papers (2022-10-30T15:13:18Z) - Scaling Laws Under the Microscope: Predicting Transformer Performance
from Small Scale Experiments [42.793379799720434]
We investigate whether scaling laws can be used to accelerate model development.
We find that scaling laws emerge at finetuning time in some NLP tasks.
For tasks where scaling laws exist, they can be used to predict the performance of larger models.
arXiv Detail & Related papers (2022-02-13T19:13:00Z) - Characterizing and overcoming the greedy nature of learning in
multi-modal deep neural networks [62.48782506095565]
We show that due to the greedy nature of learning in deep neural networks, models tend to rely on just one modality while under-fitting the other modalities.
We propose an algorithm to balance the conditional learning speeds between modalities during training and demonstrate that it indeed addresses the issue of greedy learning.
arXiv Detail & Related papers (2022-02-10T20:11:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.