Related papers: Stabilizing Transformer Training by Preventing Attention Entropy Collapse

Stabilizing Transformer Training by Preventing Attention Entropy Collapse

URL: http://arxiv.org/abs/2303.06296v2
Date: Tue, 25 Jul 2023 17:42:37 GMT
Title: Stabilizing Transformer Training by Preventing Attention Entropy Collapse
Authors: Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, Josh Susskind
Abstract summary: We investigate the training dynamics of Transformers by examining the evolution of the attention layers. We show that $sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training. We conduct experiments with $sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks.
Score: 56.45313891694746
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the attention entropy for each attention head during the course of training, which is a proxy for model sharpness. We identify a common pattern across different architectures and tasks, where low attention entropy is accompanied by high training instability, which can take the form of oscillating loss or divergence. We denote the pathologically low attention entropy, corresponding to highly concentrated attention scores, as $\textit{entropy collapse}$. As a remedy, we propose $\sigma$Reparam, a simple and efficient solution where we reparametrize all linear layers with spectral normalization and an additional learned scalar. We demonstrate that $\sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training. Additionally, we prove a tight lower bound of the attention entropy, which decreases exponentially fast with the spectral norm of the attention logits, providing additional motivation for our approach. We conduct experiments with $\sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks. We show that $\sigma$Reparam provides stability and robustness with respect to the choice of hyperparameters, going so far as enabling training (a) a Vision Transformer {to competitive performance} without warmup, weight decay, layer normalization or adaptive optimizers; (b) deep architectures in machine translation and (c) speech recognition to competitive performance without warmup and adaptive optimizers. Code is available at \url{https://github.com/apple/ml-sigma-reparam}.

Related papers

Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation [7.2136602534376015]
Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance.<n>In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse and entropy collapse.<n>Here, we provide an analytical theory of signal propagation through vanilla transformer blocks with self-attention layers.
arXiv Detail & Related papers (2025-05-30T08:18:23Z)
Taming Transformer Without Using Learning Rate Warmup [11.9495483265072]
Scaling Transformer to a large scale without using some technical tricks such as learning rate warump is an extremely challenging task.<n>We present a novel optimization strategy, ie, making the weight updating in successive steps smooth.<n>We conduct extensive experiments using ViT, Swin-Transformer and GPT, showing that our strategy can effectively and stably train these Transformers without using learning rate warmup.
arXiv Detail & Related papers (2025-05-28T02:55:28Z)
Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers [5.187307904567701]
We propose a magnitude-preserving design that stabilizes training without normalization layers.<n>Motivated by the goal of maintaining activation magnitudes, we additionally introduce rotation modulation.<n>We show that magnitude-preserving strategies significantly improve performance, notably reducing FID scores by $sim$12.8%.
arXiv Detail & Related papers (2025-05-25T12:25:50Z)
Transformer Meets Twicing: Harnessing Unattended Residual Information [2.1605931466490795]
Transformer-based deep learning models have achieved state-of-the-art performance across numerous language and vision tasks. While the self-attention mechanism has proven capable of handling complex data patterns, it has been observed that the representational capacity of the attention matrix degrades significantly across transformer layers. We propose the Twicing Attention, a novel attention mechanism that uses kernel twicing procedure in nonparametric regression to alleviate the low-pass behavior of associated NLM smoothing.
arXiv Detail & Related papers (2025-03-02T01:56:35Z)
ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans [13.695885742446027]
Self-attention can lead to over-concentration on a few tokens during training, resulting in suboptimal information flow. We introduce a novel, fully parallelizable doubly-stochastic attention mechanism based on sliced optimal transport. Our method enforces doubleity without iterative Sinkhorn normalization, significantly enhancing efficiency.
arXiv Detail & Related papers (2025-02-11T21:20:48Z)
Abrupt Learning in Transformers: A Case Study on Matrix Completion [15.210510215283882]
We formulate the low-rank matrix completion problem as a masked language modeling (MLM) task. We show that it is possible to train a BERT model to solve this task to low error. We also analyze the training dynamics of individual model components to understand the sudden drop in loss.
arXiv Detail & Related papers (2024-10-29T17:08:06Z)
Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis [97.54180451650122]
We study the dynamics of training a shallow transformer on a task of recognizing co-occurrence of two designated words. We analyze the gradient flow dynamics of simultaneously training three attention matrices and a linear layer. We prove a novel property of the gradient flow, termed textitautomatic balancing of gradients, which enables the loss values of different samples to decrease almost at the same rate and further facilitates the proof of near minimum training loss.
arXiv Detail & Related papers (2024-10-12T17:50:58Z)
Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture. We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z)
Theory, Analysis, and Best Practices for Sigmoid Self-Attention [16.73166377436999]
We revisit sigmoid attention and conduct an in-depth theoretical and empirical analysis. We prove that transformers with sigmoid attention are universal function approximators. We introduce FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid attention.
arXiv Detail & Related papers (2024-09-06T17:53:26Z)
A Primal-Dual Framework for Transformers and Neural Networks [52.814467832108875]
Self-attention is key to the remarkable success of transformers in sequence modeling tasks. We show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem. We propose two new attentions: Batch Normalized Attention (Attention-BN) and Attention with Scaled Head (Attention-SH)
arXiv Detail & Related papers (2024-06-19T19:11:22Z)
LayerCollapse: Adaptive compression of neural networks [13.567747247563108]
Transformer networks outperform prior art in Natural Language processing and Computer Vision. Models contain hundreds of millions of parameters, demanding significant computational resources. We present LayerCollapse, a novel structured pruning method to reduce the depth of fully connected layers.
arXiv Detail & Related papers (2023-11-29T01:23:41Z)
DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency. The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on. Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z)
Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training. We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.