DNT: a Deeply Normalized Transformer that can be trained by Momentum SGD
- URL: http://arxiv.org/abs/2507.17501v1
- Date: Wed, 23 Jul 2025 13:37:23 GMT
- Title: DNT: a Deeply Normalized Transformer that can be trained by Momentum SGD
- Authors: Xianbiao Qi, Marco Chen, Wenjie Xiao, Jiaquan Ye, Yelin He, Chun-Guang Li, Zhouchen Lin,
- Abstract summary: We introduce a Deeply Normalized Transformer (DNT) to overcome the limitation enabling seamless training with vanilla mSGDW.<n>To be specific, in DNT, we strategically integrate normalization techniques at proper positions in the Transformers to effectively modulate the Jacobian matrices of each layer.<n>We provide both theoretical justifications of the normalization technique used in our DNT and extensive empirical evaluation on two popular Transformer architectures.
- Score: 43.19878131775045
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers have become the de facto backbone of modern deep learning, yet their training typically demands an advanced optimizer with adaptive learning rate like AdamW, rather than a momentum SGDW (mSGDW). Previous works show that it is mainly due to a heavy-tailed distribution of the gradients. In this paper, we introduce a Deeply Normalized Transformer (DNT), which is meticulously engineered to overcome this limitation enabling seamless training with vanilla mSGDW while yielding comparable performance to the Transformers trained via AdamW. To be specific, in DNT, we strategically integrate normalization techniques at proper positions in the Transformers to effectively modulate the Jacobian matrices of each layer, balance the influence of weights, activations, and their interactions, and thus enable the distributions of gradients concentrated. We provide both theoretical justifications of the normalization technique used in our DNT and extensive empirical evaluation on two popular Transformer architectures to validate that: a) DNT outperforms its counterparts (\ie, ViT and GPT), and b) DNT can be effectively trained with vanilla mSGDW.
Related papers
- Transformers without Normalization [58.778767721826206]
We introduce Dynamic Tanh (DyT), an element-wise operation $DyT($x$) = tanh(alpha $x$)$, as a drop-in replacement for normalization layers in Transformers.<n>We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models.
arXiv Detail & Related papers (2025-03-13T17:59:06Z) - Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought [46.71030329872635]
Chain of Thought (CoT) prompting has been shown to significantly improve the performance of large language models (LLMs)<n>We study the training dynamics of transformers over a CoT objective on an in-context weight prediction task for linear regression.
arXiv Detail & Related papers (2025-02-28T16:40:38Z) - OT-Transformer: A Continuous-time Transformer Architecture with Optimal Transport Regularization [1.7180235064112577]
We consider a dynamical system whose governing equation is parametrized by transformer blocks.<n>We leverage optimal transport theory to regularize the training problem, which enhances stability in training and improves generalization of the resulting model.
arXiv Detail & Related papers (2025-01-30T22:52:40Z) - Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model.
Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z) - A General and Efficient Training for Transformer via Token Expansion [44.002355107931805]
Vision Transformers (ViTs) typically require an extremely large training cost.
Existing methods have attempted to accelerate the training of ViTs, yet typically disregard method with accuracy dropping.
We propose a novel token growth scheme Token Expansion (termed ToE) to achieve consistent training acceleration for ViTs.
arXiv Detail & Related papers (2024-03-31T12:44:24Z) - 2-D SSM: A General Spatial Layer for Visual Transformers [79.4957965474334]
A central objective in computer vision is to design models with appropriate 2-D inductive bias.
We leverage an expressive variation of the multidimensional State Space Model.
Our approach introduces efficient parameterization, accelerated computation, and a suitable normalization scheme.
arXiv Detail & Related papers (2023-06-11T09:41:37Z) - Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z) - Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training.
We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.