Scaling Laws for Acoustic Models
- URL: http://arxiv.org/abs/2106.09488v1
- Date: Fri, 11 Jun 2021 18:59:24 GMT
- Title: Scaling Laws for Acoustic Models
- Authors: Jasha Droppo and Oguz Elibol
- Abstract summary: Recent work has shown that autoregressive generative models with cross-entropy objective functions exhibit smooth power-law relationships.
We show that acoustic models trained with an auto-predictive coding loss behave as if they are subject to similar scaling laws.
- Score: 7.906034575114518
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is a recent trend in machine learning to increase model quality by
growing models to sizes previously thought to be unreasonable. Recent work has
shown that autoregressive generative models with cross-entropy objective
functions exhibit smooth power-law relationships, or scaling laws, that predict
model quality from model size, training set size, and the available compute
budget. These scaling laws allow one to choose nearly optimal hyper-parameters
given constraints on available training data, model parameter count, or
training computation budget. In this paper, we demonstrate that acoustic models
trained with an auto-predictive coding loss behave as if they are subject to
similar scaling laws. We extend previous work to jointly predict loss due to
model size, to training set size, and to the inherent "irreducible loss" of the
task. We find that the scaling laws accurately match model performance over two
orders of magnitude in both model size and training set size, and make
predictions about the limits of model performance.
Related papers
- Scaling Laws for Pre-training Agents and World Models [22.701210075508147]
Performance of embodied agents has been shown to improve by increasing model parameters, dataset size, and compute.
This paper characterizes the role of scale in these tasks more precisely.
arXiv Detail & Related papers (2024-11-07T04:57:40Z) - Scaling Laws for Precision [73.24325358259753]
We devise "precision-aware" scaling laws for both training and inference.
For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data.
For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions.
arXiv Detail & Related papers (2024-11-07T00:10:10Z) - A Hitchhiker's Guide to Scaling Law Estimation [56.06982415792523]
Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets.
We estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families.
arXiv Detail & Related papers (2024-10-15T17:59:10Z) - Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models.
We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models.
We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z) - More Compute Is What You Need [3.184416958830696]
We propose a new scaling law that suggests model performance depends mostly on the amount of compute spent for transformer-based models.
We predict that (a) for inference efficiency, training should prioritize smaller model sizes and larger training datasets, and (b) assuming the exhaustion of available web datasets, scaling the model size might be the only way to further improve model performance.
arXiv Detail & Related papers (2024-04-30T12:05:48Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - A Solvable Model of Neural Scaling Laws [72.8349503901712]
Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws.
We propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology.
Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps.
arXiv Detail & Related papers (2022-10-30T15:13:18Z) - Scaling Laws for Neural Language Models [14.472857826717613]
We study scaling laws for language model performance on the cross-entropy loss.
The loss scales as a power-law with model size, dataset size, and the amount of compute used for training.
arXiv Detail & Related papers (2020-01-23T03:59:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.