Understanding the Difficulty of Training Transformers
- URL: http://arxiv.org/abs/2004.08249v3
- Date: Sun, 1 Oct 2023 18:34:20 GMT
- Title: Understanding the Difficulty of Training Transformers
- Authors: Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, Jiawei Han
- Abstract summary: We show that unbalanced gradients are not the root cause of the instability of training.
We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
- Score: 120.99980924577787
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have proved effective in many NLP tasks. However, their training
requires non-trivial efforts regarding designing cutting-edge optimizers and
learning rate schedulers carefully (e.g., conventional SGD fails to train
Transformers effectively). Our objective here is to understand $\textit{what
complicates Transformer training}$ from both empirical and theoretical
perspectives. Our analysis reveals that unbalanced gradients are not the root
cause of the instability of training. Instead, we identify an amplification
effect that influences training substantially -- for each layer in a
multi-layer Transformer model, heavy dependency on its residual branch makes
training unstable, since it amplifies small parameter perturbations (e.g.,
parameter updates) and results in significant disturbances in the model output.
Yet we observe that a light dependency limits the model potential and leads to
inferior trained models. Inspired by our analysis, we propose Admin
($\textbf{Ad}$aptive $\textbf{m}$odel $\textbf{in}$itialization) to stabilize
stabilize the early stage's training and unleash its full potential in the late
stage. Extensive experiments show that Admin is more stable, converges faster,
and leads to better performance. Implementations are released at:
https://github.com/LiyuanLucasLiu/Transforemr-Clinic.
Related papers
- On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks.
We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z) - Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model.
Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability [34.43255978863601]
Several suggest that transformers learn a mesa-optimizer during autorere training.
We show that a stronger assumption related to the moments of data is the sufficient necessary condition that the learned mesa-optimizer can perform.
arXiv Detail & Related papers (2024-05-27T05:41:06Z) - Dynamic Layer Tying for Parameter-Efficient Transformers [65.268245109828]
We employ Reinforcement Learning to select layers during training and tie them together.
This facilitates weight sharing, reduces the number of trainable parameters, and also serves as an effective regularization technique.
In particular, the memory consumption during training is up to one order of magnitude less than the conventional training method.
arXiv Detail & Related papers (2024-01-23T14:53:20Z) - BranchNorm: Robustly Scaling Extremely Deep Transformers [55.92852268168816]
BranchNorm dynamically rescales the non-residual branch of Transformer in accordance with the training period.
Experiment results on multiple translation tasks demonstrate that BranchNorm achieves a better trade-off between training stability and converge performance.
arXiv Detail & Related papers (2023-05-04T12:46:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.