Small-scale proxies for large-scale Transformer training instabilities
- URL: http://arxiv.org/abs/2309.14322v2
- Date: Mon, 16 Oct 2023 18:43:25 GMT
- Title: Small-scale proxies for large-scale Transformer training instabilities
- Authors: Mitchell Wortsman, Peter J. Liu, Lechao Xiao, Katie Everett, Alex
Alemi, Ben Adlam, John D. Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman
Novak, Jeffrey Pennington, Jascha Sohl-dickstein, Kelvin Xu, Jaehoon Lee,
Justin Gilmer, Simon Kornblith
- Abstract summary: We seek ways to reproduce and study training stability and instability at smaller scales.
By measuring the relationship between learning rate and loss across scales, we show that these instabilities also appear in small models when training at high learning rates.
We study methods such as warm-up, weight decay, and the $mu$Param to train small models that achieve similar losses across orders of magnitude of learning rate variation.
- Score: 69.36381318171338
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Teams that have trained large Transformer-based models have reported training
instabilities at large scale that did not appear when training with the same
hyperparameters at smaller scales. Although the causes of such instabilities
are of scientific interest, the amount of resources required to reproduce them
has made investigation difficult. In this work, we seek ways to reproduce and
study training stability and instability at smaller scales. First, we focus on
two sources of training instability described in previous work: the growth of
logits in attention layers (Dehghani et al., 2023) and divergence of the output
logits from the log probabilities (Chowdhery et al., 2022). By measuring the
relationship between learning rate and loss across scales, we show that these
instabilities also appear in small models when training at high learning rates,
and that mitigations previously employed at large scales are equally effective
in this regime. This prompts us to investigate the extent to which other known
optimizer and model interventions influence the sensitivity of the final loss
to changes in the learning rate. To this end, we study methods such as warm-up,
weight decay, and the $\mu$Param (Yang et al., 2022), and combine techniques to
train small models that achieve similar losses across orders of magnitude of
learning rate variation. Finally, to conclude our exploration we study two
cases where instabilities can be predicted before they emerge by examining the
scaling behavior of model activation and gradient norms.
Related papers
- Tending Towards Stability: Convergence Challenges in Small Language Models [3.734405405403176]
Despite their advantages, smaller models frequently underperform compared to their larger counterparts.
This is anecdotally attributed to their reduced representational capacity.
By linking the convergence of layers' activations to their parameters' effective rank, our analyses can guide future work to address inefficiencies in the learning dynamics of small models.
arXiv Detail & Related papers (2024-10-15T09:57:19Z) - On the Inductive Bias of Stacking Towards Improving Reasoning [50.225873619537765]
We propose a variant of gradual stacking called MIDAS that can speed up language model training by up to 40%.
MIDAS is not only training-efficient but surprisingly also has an inductive bias towards improving downstream tasks.
We conjecture the underlying reason for this inductive bias by exploring the connection of stacking to looped models.
arXiv Detail & Related papers (2024-09-27T17:58:21Z) - Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations [62.132347451049455]
Scale has become a main ingredient in obtaining strong machine learning models.
In this work, we argue that scale and training research has been needlessly complex due to reliance on the cosine schedule.
We show that weight averaging yields improved performance along the training trajectory, without additional training costs, across different scales.
arXiv Detail & Related papers (2024-05-28T17:33:54Z) - Reusing Pretrained Models by Multi-linear Operators for Efficient
Training [65.64075958382034]
Training large models from scratch usually costs a substantial amount of resources.
Recent studies such as bert2BERT and LiGO have reused small pretrained models to initialize a large model.
We propose a method that linearly correlates each weight of the target model to all the weights of the pretrained model.
arXiv Detail & Related papers (2023-10-16T06:16:47Z) - What Happens During Finetuning of Vision Transformers: An Invariance
Based Investigation [7.432224771219168]
The pretrain-finetune paradigm usually improves downstream performance over training a model from scratch on the same task.
In this work, we examine the relationship between pretrained vision transformers and the corresponding finetuned versions on several benchmark datasets and tasks.
arXiv Detail & Related papers (2023-07-12T08:35:24Z) - The Emergence of Essential Sparsity in Large Pre-trained Models: The
Weights that Matter [113.35761858962522]
This paper studies induced sparse patterns across multiple large pre-trained vision and language transformers.
We propose the existence of essential sparsity defined with a sharp dropping point beyond which the performance declines much faster.
We also find essential sparsity to hold valid for N:M sparsity patterns as well as on modern-scale large language models.
arXiv Detail & Related papers (2023-06-06T15:49:09Z) - Exploring Weight Balancing on Long-Tailed Recognition Problem [32.01426831450348]
Recognition problems in long-tailed data, in which the sample size per class is heavily skewed, have gained importance.
Weight balancing, which combines classical regularization techniques with two-stage training, has been proposed.
We analyze weight balancing by focusing on neural collapse and the cone effect at each training stage.
arXiv Detail & Related papers (2023-05-26T01:45:19Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - An Empirical Investigation of the Role of Pre-training in Lifelong
Learning [21.995593026269578]
We show that generic pre-training implicitly alleviates the effects of catastrophic forgetting when learning multiple tasks sequentially.
We study this phenomenon by analyzing the loss landscape, finding that pre-trained weights appear to ease forgetting by leading to wider minima.
arXiv Detail & Related papers (2021-12-16T19:00:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.