Related papers: Small-scale proxies for large-scale Transformer training instabilities

Small-scale proxies for large-scale Transformer training instabilities

URL: http://arxiv.org/abs/2309.14322v2
Date: Mon, 16 Oct 2023 18:43:25 GMT
Title: Small-scale proxies for large-scale Transformer training instabilities
Authors: Mitchell Wortsman, Peter J. Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D. Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, Jeffrey Pennington, Jascha Sohl-dickstein, Kelvin Xu, Jaehoon Lee, Justin Gilmer, Simon Kornblith
Abstract summary: We seek ways to reproduce and study training stability and instability at smaller scales. By measuring the relationship between learning rate and loss across scales, we show that these instabilities also appear in small models when training at high learning rates. We study methods such as warm-up, weight decay, and the $mu$Param to train small models that achieve similar losses across orders of magnitude of learning rate variation.
Score: 69.36381318171338
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Teams that have trained large Transformer-based models have reported training instabilities at large scale that did not appear when training with the same hyperparameters at smaller scales. Although the causes of such instabilities are of scientific interest, the amount of resources required to reproduce them has made investigation difficult. In this work, we seek ways to reproduce and study training stability and instability at smaller scales. First, we focus on two sources of training instability described in previous work: the growth of logits in attention layers (Dehghani et al., 2023) and divergence of the output logits from the log probabilities (Chowdhery et al., 2022). By measuring the relationship between learning rate and loss across scales, we show that these instabilities also appear in small models when training at high learning rates, and that mitigations previously employed at large scales are equally effective in this regime. This prompts us to investigate the extent to which other known optimizer and model interventions influence the sensitivity of the final loss to changes in the learning rate. To this end, we study methods such as warm-up, weight decay, and the $\mu$Param (Yang et al., 2022), and combine techniques to train small models that achieve similar losses across orders of magnitude of learning rate variation. Finally, to conclude our exploration we study two cases where instabilities can be predicted before they emerge by examining the scaling behavior of model activation and gradient norms.

Related papers

Tending Towards Stability: Convergence Challenges in Small Language Models [3.734405405403176]
Despite their advantages, smaller models frequently underperform compared to their larger counterparts. This is anecdotally attributed to their reduced representational capacity. By linking the convergence of layers' activations to their parameters' effective rank, our analyses can guide future work to address inefficiencies in the learning dynamics of small models.
arXiv Detail & Related papers (2024-10-15T09:57:19Z)
On the Inductive Bias of Stacking Towards Improving Reasoning [50.225873619537765]
We propose a variant of gradual stacking called MIDAS that can speed up language model training by up to 40%. MIDAS is not only training-efficient but surprisingly also has an inductive bias towards improving downstream tasks. We conjecture the underlying reason for this inductive bias by exploring the connection of stacking to looped models.
arXiv Detail & Related papers (2024-09-27T17:58:21Z)
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations [62.132347451049455]
Scale has become a main ingredient in obtaining strong machine learning models. In this work, we argue that scale and training research has been needlessly complex due to reliance on the cosine schedule. We show that weight averaging yields improved performance along the training trajectory, without additional training costs, across different scales.
arXiv Detail & Related papers (2024-05-28T17:33:54Z)
Understanding Emergent Abilities of Language Models from the Loss Perspective [32.81782726603632]
We study emergent abilities in the lens of pre-training loss, instead of model size or training compute. We find that a model exhibits emergent abilities on certain tasks regardless of the continuity of metrics. This inspires us to redefine emergent abilities as those that manifest in models with lower pre-training losses.
arXiv Detail & Related papers (2024-03-23T11:03:31Z)
Reusing Pretrained Models by Multi-linear Operators for Efficient Training [65.64075958382034]
Training large models from scratch usually costs a substantial amount of resources. Recent studies such as bert2BERT and LiGO have reused small pretrained models to initialize a large model. We propose a method that linearly correlates each weight of the target model to all the weights of the pretrained model.
arXiv Detail & Related papers (2023-10-16T06:16:47Z)
What Happens During Finetuning of Vision Transformers: An Invariance Based Investigation [7.432224771219168]
The pretrain-finetune paradigm usually improves downstream performance over training a model from scratch on the same task. In this work, we examine the relationship between pretrained vision transformers and the corresponding finetuned versions on several benchmark datasets and tasks.
arXiv Detail & Related papers (2023-07-12T08:35:24Z)
The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter [113.35761858962522]
This paper studies induced sparse patterns across multiple large pre-trained vision and language transformers. We propose the existence of essential sparsity defined with a sharp dropping point beyond which the performance declines much faster. We also find essential sparsity to hold valid for N:M sparsity patterns as well as on modern-scale large language models.
arXiv Detail & Related papers (2023-06-06T15:49:09Z)
Exploring Weight Balancing on Long-Tailed Recognition Problem [32.01426831450348]
Recognition problems in long-tailed data, in which the sample size per class is heavily skewed, have gained importance. Weight balancing, which combines classical regularization techniques with two-stage training, has been proposed. We analyze weight balancing by focusing on neural collapse and the cone effect at each training stage.
arXiv Detail & Related papers (2023-05-26T01:45:19Z)
Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z)
An Empirical Investigation of the Role of Pre-training in Lifelong Learning [21.995593026269578]
We show that generic pre-training implicitly alleviates the effects of catastrophic forgetting when learning multiple tasks sequentially. We study this phenomenon by analyzing the loss landscape, finding that pre-trained weights appear to ease forgetting by leading to wider minima.
arXiv Detail & Related papers (2021-12-16T19:00:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.