A Hitchhiker's Guide to Scaling Law Estimation
- URL: http://arxiv.org/abs/2410.11840v1
- Date: Tue, 15 Oct 2024 17:59:10 GMT
- Title: A Hitchhiker's Guide to Scaling Law Estimation
- Authors: Leshem Choshen, Yang Zhang, Jacob Andreas,
- Abstract summary: Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets.
We estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families.
- Score: 56.06982415792523
- License:
- Abstract: Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets. This provides an efficient way for practitioners and researchers alike to compare pretraining decisions involving optimizers, datasets, and model architectures. Despite the widespread use of scaling laws to model the dynamics of language model training, there has been little work on understanding how to best estimate and interpret them. We collect (and release) a large-scale dataset containing losses and downstream evaluations for 485 previously published pretrained models. We use these to estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families. We find that fitting scaling laws to intermediate checkpoints of training runs (and not just their final losses) substantially improves accuracy, and that -- all else equal -- estimates of performance are generally most accurate when derived from other models of similar sizes. However, because there is a significant degree of variability across model seeds, training multiple small models is sometimes more useful than training a single large one. Moreover, while different model families differ scaling behavior, they are often similar enough that a target model's behavior can be predicted from a single model with the same architecture, along with scaling parameter estimates derived from other model families.
Related papers
- Scaling Laws for Precision [73.24325358259753]
We devise "precision-aware" scaling laws for both training and inference.
For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data.
For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions.
arXiv Detail & Related papers (2024-11-07T00:10:10Z) - Causal Estimation of Memorisation Profiles [58.20086589761273]
Understanding memorisation in language models has practical and societal implications.
Memorisation is the causal effect of training with an instance on the model's ability to predict that instance.
This paper proposes a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics.
arXiv Detail & Related papers (2024-06-06T17:59:09Z) - Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models.
We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models.
We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z) - More Compute Is What You Need [3.184416958830696]
We propose a new scaling law that suggests model performance depends mostly on the amount of compute spent for transformer-based models.
We predict that (a) for inference efficiency, training should prioritize smaller model sizes and larger training datasets, and (b) assuming the exhaustion of available web datasets, scaling the model size might be the only way to further improve model performance.
arXiv Detail & Related papers (2024-04-30T12:05:48Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Scaling Laws for Acoustic Models [7.906034575114518]
Recent work has shown that autoregressive generative models with cross-entropy objective functions exhibit smooth power-law relationships.
We show that acoustic models trained with an auto-predictive coding loss behave as if they are subject to similar scaling laws.
arXiv Detail & Related papers (2021-06-11T18:59:24Z) - Scaling Laws for Neural Language Models [14.472857826717613]
We study scaling laws for language model performance on the cross-entropy loss.
The loss scales as a power-law with model size, dataset size, and the amount of compute used for training.
arXiv Detail & Related papers (2020-01-23T03:59:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.