Training Trajectories of Language Models Across Scales
- URL: http://arxiv.org/abs/2212.09803v3
- Date: Tue, 30 May 2023 03:33:27 GMT
- Title: Training Trajectories of Language Models Across Scales
- Authors: Mengzhou Xia, Mikel Artetxe, Chunting Zhou, Xi Victoria Lin, Ramakanth
Pasunuru, Danqi Chen, Luke Zettlemoyer, Ves Stoyanov
- Abstract summary: Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
- Score: 99.38721327771208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scaling up language models has led to unprecedented performance gains, but
little is understood about how the training dynamics change as models get
larger. How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors? In this
paper, we analyze the intermediate training checkpoints of differently sized
OPT models (Zhang et al.,2022)--from 125M to 175B parameters--on next-token
prediction, sequence-level generation, and downstream tasks. We find that 1) at
a given perplexity and independent of model sizes, a similar subset of training
tokens see the most significant reduction in loss, with the rest stagnating or
showing double-descent behavior; 2) early in training, all models learn to
reduce the perplexity of grammatical sequences that contain hallucinations,
with small models halting at this suboptimal distribution and larger ones
eventually learning to assign these sequences lower probabilities; 3)
perplexity is a strong predictor of in-context learning performance on 74
multiple-choice tasks from BIG-Bench, and this holds independent of the model
size. Together, these results show that perplexity is more predictive of model
behaviors than model size or training computation.
Related papers
- A Hitchhiker's Guide to Scaling Law Estimation [56.06982415792523]
Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets.
We estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families.
arXiv Detail & Related papers (2024-10-15T17:59:10Z) - Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization [22.90653167145603]
We introduce HyperCloning, a method that can expand the parameters of a pre-trained language model to those of a larger model with increased hidden dimensions.
As a result, the larger model already inherits the predictive power and accuracy of the smaller model before the training starts.
arXiv Detail & Related papers (2024-09-19T16:50:26Z) - Effects of Scale on Language Model Robustness [7.725206196110384]
We show that adversarially trained larger models generalize faster and better to modified attacks not seen during training when compared with smaller models.
We also analyze the offense/defense balance of increasing compute, finding parity in some settings and an advantage for offense in others.
arXiv Detail & Related papers (2024-07-25T17:26:41Z) - Causal Estimation of Memorisation Profiles [58.20086589761273]
Understanding memorisation in language models has practical and societal implications.
Memorisation is the causal effect of training with an instance on the model's ability to predict that instance.
This paper proposes a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics.
arXiv Detail & Related papers (2024-06-06T17:59:09Z) - Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models.
We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models.
We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z) - Emergent Abilities in Reduced-Scale Generative Language Models [10.51168925267033]
Large language models can solve new tasks without task-specific fine-tuning.
This ability is considered an emergent ability and is primarily seen in large language models with billions of parameters.
This study investigates if such emergent properties are strictly tied to model size or can be demonstrated by smaller models trained on reduced-scale data.
arXiv Detail & Related papers (2024-04-02T18:00:28Z) - Frequency Explains the Inverse Correlation of Large Language Models'
Size, Training Data Amount, and Surprisal's Fit to Reading Times [15.738530737312335]
Recent studies have shown that as Transformer-based language models become larger and are trained on very large amounts of data, the fit of their surprisal estimates to naturalistic human reading times degrades.
This paper presents a series of analyses showing that word frequency is a key explanatory factor underlying these two trends.
The results indicate that Transformer-based language models' surprisal estimates diverge from human-like expectations due to the superhumanly complex associations they learn for predicting rare words.
arXiv Detail & Related papers (2024-02-03T20:22:54Z) - Code Representation Learning At Scale [75.04686476303436]
We fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme.
We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language.
We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner.
arXiv Detail & Related papers (2024-02-02T22:19:15Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.