Stabilizing Transformer Training by Preventing Attention Entropy
Collapse
- URL: http://arxiv.org/abs/2303.06296v2
- Date: Tue, 25 Jul 2023 17:42:37 GMT
- Title: Stabilizing Transformer Training by Preventing Attention Entropy
Collapse
- Authors: Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge,
Jason Ramapuram, Yizhe Zhang, Jiatao Gu, Josh Susskind
- Abstract summary: We investigate the training dynamics of Transformers by examining the evolution of the attention layers.
We show that $sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training.
We conduct experiments with $sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks.
- Score: 56.45313891694746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training stability is of great importance to Transformers. In this work, we
investigate the training dynamics of Transformers by examining the evolution of
the attention layers. In particular, we track the attention entropy for each
attention head during the course of training, which is a proxy for model
sharpness. We identify a common pattern across different architectures and
tasks, where low attention entropy is accompanied by high training instability,
which can take the form of oscillating loss or divergence. We denote the
pathologically low attention entropy, corresponding to highly concentrated
attention scores, as $\textit{entropy collapse}$. As a remedy, we propose
$\sigma$Reparam, a simple and efficient solution where we reparametrize all
linear layers with spectral normalization and an additional learned scalar. We
demonstrate that $\sigma$Reparam successfully prevents entropy collapse in the
attention layers, promoting more stable training. Additionally, we prove a
tight lower bound of the attention entropy, which decreases exponentially fast
with the spectral norm of the attention logits, providing additional motivation
for our approach. We conduct experiments with $\sigma$Reparam on image
classification, image self-supervised learning, machine translation, speech
recognition, and language modeling tasks. We show that $\sigma$Reparam provides
stability and robustness with respect to the choice of hyperparameters, going
so far as enabling training (a) a Vision Transformer {to competitive
performance} without warmup, weight decay, layer normalization or adaptive
optimizers; (b) deep architectures in machine translation and (c) speech
recognition to competitive performance without warmup and adaptive optimizers.
Code is available at \url{https://github.com/apple/ml-sigma-reparam}.
Related papers
- Abrupt Learning in Transformers: A Case Study on Matrix Completion [15.210510215283882]
We formulate the low-rank matrix completion problem as a masked language modeling (MLM) task.
We show that it is possible to train a BERT model to solve this task to low error.
We also analyze the training dynamics of individual model components to understand the sudden drop in loss.
arXiv Detail & Related papers (2024-10-29T17:08:06Z) - Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis [97.54180451650122]
We study the dynamics of training a shallow transformer on a task of recognizing co-occurrence of two designated words.
We analyze the gradient flow dynamics of simultaneously training three attention matrices and a linear layer.
We prove a novel property of the gradient flow, termed textitautomatic balancing of gradients, which enables the loss values of different samples to decrease almost at the same rate and further facilitates the proof of near minimum training loss.
arXiv Detail & Related papers (2024-10-12T17:50:58Z) - Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture.
We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z) - Theory, Analysis, and Best Practices for Sigmoid Self-Attention [16.73166377436999]
We revisit sigmoid attention and conduct an in-depth theoretical and empirical analysis.
We prove that transformers with sigmoid attention are universal function approximators.
We introduce FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid attention.
arXiv Detail & Related papers (2024-09-06T17:53:26Z) - A Primal-Dual Framework for Transformers and Neural Networks [52.814467832108875]
Self-attention is key to the remarkable success of transformers in sequence modeling tasks.
We show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem.
We propose two new attentions: Batch Normalized Attention (Attention-BN) and Attention with Scaled Head (Attention-SH)
arXiv Detail & Related papers (2024-06-19T19:11:22Z) - LayerCollapse: Adaptive compression of neural networks [13.567747247563108]
Transformer networks outperform prior art in Natural Language processing and Computer Vision.
Models contain hundreds of millions of parameters, demanding significant computational resources.
We present LayerCollapse, a novel structured pruning method to reduce the depth of fully connected layers.
arXiv Detail & Related papers (2023-11-29T01:23:41Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z) - Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training.
We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.