Related papers: HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization

HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization

URL: http://arxiv.org/abs/2503.04598v3
Date: Thu, 22 May 2025 14:53:31 GMT
Title: HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
Authors: Zhijian Zhuo, Yutao Zeng, Ya Wang, Sijun Zhang, Jian Yang, Xiaoqing Li, Xun Zhou, Jinwen Ma,
Abstract summary: We propose a simple yet effective hybrid normalization strategy that integrates the advantages of Pre-Norm and Post-Norm.<n>In experiments on large-scale transformer models, we show that HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches.<n>These findings highlight the potential of HybridNorm as a more stable and effective technique for improving the training and performance of deep transformer models.
Score: 25.87557024380553
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers have become the de facto architecture for a wide range of machine learning tasks, particularly in large language models (LLMs). Despite their remarkable performance, challenges remain in training deep transformer networks, especially regarding the position of layer normalization. While Pre-Norm structures facilitate more stable training owing to their stronger identity path, they often lead to suboptimal performance compared to Post-Norm. In this paper, we propose $\textbf{HybridNorm}$, a simple yet effective hybrid normalization strategy that integrates the advantages of both Pre-Norm and Post-Norm. Specifically, HybridNorm employs QKV normalization within the attention mechanism and Post-Norm in the feed-forward network (FFN) of each transformer block. We provide both theoretical insights and empirical evidence demonstrating that HybridNorm improves gradient flow and model robustness. Extensive experiments on large-scale transformer models, including both dense and sparse variants, show that HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches across multiple benchmarks. These findings highlight the potential of HybridNorm as a more stable and effective technique for improving the training and performance of deep transformer models. Code is available at https://github.com/BryceZhuo/HybridNorm.

Related papers

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm [31.43772956034752]
Modern Transformers predominantly adopt the Pre-Norm paradigm for its optimization stability, foregoing the superior potential of the unstable Post-Norm architecture.<n>We propose SiameseNorm, a two-stream architecture that couples Pre-Norm-like and Post-Norm-like streams with shared parameters.<n>This design decouples the optimization dynamics of the two streams, retaining the distinct characteristics of both Pre-Norm and Post-Norm.
arXiv Detail & Related papers (2026-02-08T17:17:56Z)
Hybrid Dual-Path Linear Transformations for Efficient Transformer Architectures [0.0]
We introduce the Hybrid Dual-Path Linear (HDPL) operator, which decomposes the affine transformation into two topologically distinct pathways.<n> Experiments on the FineWeb-Edu dataset demonstrate that the HDPL architecture outperforms a standard Llama-style baseline.<n>We discuss how the explicit materialization of a probabilistic latent space within the Transformer backbone serves as a vital architectural affordance.
arXiv Detail & Related papers (2026-02-05T20:16:10Z)
SpanNorm: Reconciling Training Stability and Performance in Deep Transformers [55.100133502295996]
We propose SpanNorm, a novel technique designed to resolve the dilemma by integrating the strengths of both paradigms.<n>We provide a theoretical analysis demonstrating that SpanNorm, combined with a principled scaling strategy, maintains bounded signal variance throughout the network.<n> Empirically, SpanNorm consistently outperforms standard normalization schemes in both dense and Mixture-of-Experts (MoE) scenarios.
arXiv Detail & Related papers (2026-01-30T05:21:57Z)
Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts [27.8245634187787]
We present HALO, a pipeline for distilling Transformer models into RNN-attention hybrid models.<n>We then present HypeNet, a hybrid architecture with superior length generalization enabled by a novel position encoding scheme.<n>The conversion requires just 2.3B tokens, less than 0.01% of their pre-training data.
arXiv Detail & Related papers (2026-01-29T18:59:53Z)
A Constrained Optimization Perspective of Unrolled Transformers [77.12297732942095]
We introduce a constrained optimization framework for training transformers that behave like optimization descent algorithms.<n>We observe constrained transformers achieve stronger to perturbations robustness and maintain higher out-of-distribution generalization.
arXiv Detail & Related papers (2026-01-24T02:12:39Z)
Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale [68.6602625868888]
We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations.<n>Operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and compression.<n>We train end-to-end 1.2 to 2.9 times faster than optimized Transformers, and 1.1 to 1.4 times faster than previous generation hybrids.
arXiv Detail & Related papers (2025-02-25T19:47:20Z)
OT-Transformer: A Continuous-time Transformer Architecture with Optimal Transport Regularization [1.7180235064112577]
We consider a dynamical system whose governing equation is parametrized by transformer blocks.<n>We leverage optimal transport theory to regularize the training problem, which enhances stability in training and improves generalization of the resulting model.
arXiv Detail & Related papers (2025-01-30T22:52:40Z)
In-context Learning for Mixture of Linear Regressions: Existence, Generalization and Training Dynamics [34.458004744956334]
We prove that there exists a transformer capable of achieving a prediction error of order $mathcalO(sqrtd/n)$ with high probability.<n>We also analyze the training dynamics of transformers with single linear self-attention layers, demonstrating that, with appropriately parameters, gradient flow optimization over the population mean square loss converges to a global optimum.
arXiv Detail & Related papers (2024-10-18T05:28:47Z)
ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models [3.7802450241986945]
LayerNorm is a critical component in modern large language models (LLMs) for stabilizing training and ensuring smooth optimization. This work explores desirable activation functions in normalization-free decoder-only LLMs. ReLU significantly outperforms GELU in LayerNorm-free models, leading to an bf 8.2% perplexity improvement.
arXiv Detail & Related papers (2024-10-12T20:26:01Z)
Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model. Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z)
Parallelizing Linear Transformers with the Delta Rule over Sequence Length [49.88826673324244]
This work describes a hardware-efficient algorithm for training linear transformers with the delta rule.<n>We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines.
arXiv Detail & Related papers (2024-06-10T17:24:42Z)
MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations. Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality. No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z)
Linearly-evolved Transformer for Pan-sharpening [34.06189165260206]
Vision transformer family has dominated the satellite pan-sharpening field driven by the global-wise spatial information modeling mechanism. Standard modeling rules within these promising pan-sharpening methods are to roughly stack the transformer variants in a cascaded manner. We propose an efficient linearly-evolved transformer variant and employ it to construct a lightweight pan-sharpening framework.
arXiv Detail & Related papers (2024-04-19T11:38:34Z)
Unfolding Once is Enough: A Deployment-Friendly Transformer Unit for Super-Resolution [16.54421804141835]
High resolution of intermediate features in SISR models increases memory and computational requirements. We propose a Deployment-friendly Inner-patch Transformer Network (DITN) for the SISR task. Our models can achieve competitive results in terms of qualitative and quantitative performance with high deployment efficiency.
arXiv Detail & Related papers (2023-08-05T05:42:51Z)
BranchNorm: Robustly Scaling Extremely Deep Transformers [55.92852268168816]
BranchNorm dynamically rescales the non-residual branch of Transformer in accordance with the training period. Experiment results on multiple translation tasks demonstrate that BranchNorm achieves a better trade-off between training stability and converge performance.
arXiv Detail & Related papers (2023-05-04T12:46:12Z)
Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications. The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate. There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z)
Mnemosyne: Learning to Train Transformers with Transformers [18.36543176998175]
We show that Mnemosyne can successfully train Transformers while using simple meta-training strategies that require minimal computational resources. Mnemosyne provides space comparable complexity to that its hand-designed first-order counterparts, which allows it to scale to training larger sets of parameters.
arXiv Detail & Related papers (2023-02-02T14:40:28Z)
Regression Transformer: Concurrent Conditional Generation and Regression by Blending Numerical and Textual Tokens [3.421506449201873]
The Regression Transformer (RT) casts continuous properties as sequences of numerical tokens and encodes them jointly with conventional tokens. We propose several extensions to the XLNet objective and adopt an alternating training scheme to concurrently optimize property prediction and conditional text generation. This finds application particularly in property-driven, local exploration of the chemical or protein space.
arXiv Detail & Related papers (2022-02-01T08:57:31Z)
Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. A linear-complexity recurrent variant has proven well suited for autoregressive generation. This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)
Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks [75.69896269357005]
Mixup is the latest data augmentation technique that linearly interpolates input examples and the corresponding labels. In this paper, we explore how to apply mixup to natural language processing tasks. We incorporate mixup to transformer-based pre-trained architecture, named "mixup-transformer", for a wide range of NLP tasks.
arXiv Detail & Related papers (2020-10-05T23:37:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.