Related papers: Broken Neural Scaling Laws

Broken Neural Scaling Laws

URL: http://arxiv.org/abs/2210.14891v9
Date: Fri, 24 Mar 2023 17:56:23 GMT
Title: Broken Neural Scaling Laws
Authors: Ethan Caballero, Kshitij Gupta, Irina Rish, David Krueger
Abstract summary: Broken Neural Scaling Law (BNSL) accurately models and extrapolates the scaling behaviors of deep neural networks. This set includes large-scale vision, language, audio, video, diffusion, generative modeling, multimodal learning, contrastive learning, AI alignment, robotics, out-of-distribution (OOD) generalization.
Score: 9.020652910657931
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a smoothly broken power law functional form (referred to by us as a Broken Neural Scaling Law (BNSL)) that accurately models and extrapolates the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, training dataset size, model input size, number of training steps, or upstream performance varies) for various architectures and for each of various tasks within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings. This set includes large-scale vision, language, audio, video, diffusion, generative modeling, multimodal learning, contrastive learning, AI alignment, robotics, out-of-distribution (OOD) generalization, continual learning, transfer learning, uncertainty estimation / calibration, out-of-distribution detection, adversarial robustness, distillation, sparsity, retrieval, quantization, pruning, molecules, computer programming/coding, math word problems, arithmetic, unsupervised/self-supervised learning, and reinforcement learning (single agent and multi-agent). When compared to other functional forms for neural scaling behavior, this functional form yields extrapolations of scaling behavior that are considerably more accurate on this set. Moreover, this functional form accurately models and extrapolates scaling behavior that other functional forms are incapable of expressing such as the non-monotonic transitions present in the scaling behavior of phenomena such as double descent and the delayed, sharp inflection points (often called "emergent phase transitions") present in the scaling behavior of tasks such as arithmetic. Lastly, we use this functional form to glean insights about the limit of the predictability of scaling behavior. Code is available at https://github.com/ethancaballero/broken_neural_scaling_laws

Related papers

How Feature Learning Can Improve Neural Scaling Laws [86.9540615081759]
We develop a solvable model of neural scaling laws beyond the kernel limit. We show how performance scales with model size, training time, and the total amount of available data.
arXiv Detail & Related papers (2024-09-26T14:05:32Z)
Information-Theoretic Foundations for Neural Scaling Laws [20.617552198581024]
We develop information-theoretic foundations for neural scaling laws. We observe that the optimal relation between data and model size is linear, up to logarithmic factors.
arXiv Detail & Related papers (2024-06-28T02:20:54Z)
Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models. We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models. We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z)
An exactly solvable model for emergence and scaling laws in the multitask sparse parity problem [2.598133279943607]
We present a framework where each new ability (a skill) is represented as a basis function. We find analytic expressions for the emergence of new skills, as well as for scaling laws of the loss with training time, data size, model size, and optimal compute. Our simple model captures, using a single fit parameter, the sigmoidal emergence of multiple new skills as training time, data size or model size increases in the neural network.
arXiv Detail & Related papers (2024-04-26T17:45:32Z)
A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization. Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z)
Latent State Models of Training Dynamics [51.88132043461152]
We train models with different random seeds and compute a variety of metrics throughout training. We then fit a hidden Markov model (HMM) over the resulting sequences of metrics. We use the HMM representation to study phase transitions and identify latent "detour" states that slow down convergence.
arXiv Detail & Related papers (2023-08-18T13:20:08Z)
A Solvable Model of Neural Scaling Laws [72.8349503901712]
Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws. We propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology. Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps.
arXiv Detail & Related papers (2022-10-30T15:13:18Z)
Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments [42.793379799720434]
We investigate whether scaling laws can be used to accelerate model development. We find that scaling laws emerge at finetuning time in some NLP tasks. For tasks where scaling laws exist, they can be used to predict the performance of larger models.
arXiv Detail & Related papers (2022-02-13T19:13:00Z)
Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks [62.48782506095565]
We show that due to the greedy nature of learning in deep neural networks, models tend to rely on just one modality while under-fitting the other modalities. We propose an algorithm to balance the conditional learning speeds between modalities during training and demonstrate that it indeed addresses the issue of greedy learning.
arXiv Detail & Related papers (2022-02-10T20:11:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.