Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN
- URL: http://arxiv.org/abs/2412.13795v1
- Date: Wed, 18 Dec 2024 12:39:53 GMT
- Title: Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN
- Authors: Pengxiang Li, Lu Yin, Shiwei Liu,
- Abstract summary: Mix-LN is a novel normalization technique that combines the strengths of Pre-LN and Post-LN within the same model.
Experiments with various model sizes from 70M to 7B demonstrate that Mix-LN consistently outperforms both Pre-LN and Post-LN.
- Score: 19.776151399951672
- License:
- Abstract: Large Language Models (LLMs) have achieved remarkable success, yet recent findings reveal that their deeper layers often contribute minimally and can be pruned without affecting overall performance. While some view this as an opportunity for model compression, we identify it as a training shortfall rooted in the widespread use of Pre-Layer Normalization (Pre-LN). We demonstrate that Pre-LN, commonly employed in models like GPT and LLaMA, leads to diminished gradient norms in its deeper layers, reducing their effectiveness. In contrast, Post-Layer Normalization (Post-LN) preserves larger gradient norms in deeper layers but suffers from vanishing gradients in earlier layers. To address this, we introduce Mix-LN, a novel normalization technique that combines the strengths of Pre-LN and Post-LN within the same model. Mix-LN applies Post-LN to the earlier layers and Pre-LN to the deeper layers, ensuring more uniform gradients across layers. This allows all parts of the network--both shallow and deep layers--to contribute effectively to training. Extensive experiments with various model sizes from 70M to 7B demonstrate that Mix-LN consistently outperforms both Pre-LN and Post-LN, promoting more balanced, healthier gradient norms throughout the network, and enhancing the overall quality of LLM pre-training. Furthermore, we demonstrate that models pre-trained with Mix-LN learn better compared to those using Pre-LN or Post-LN during supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), highlighting the critical importance of high-quality deep layers. By effectively addressing the inefficiencies of deep layers in current LLMs, Mix-LN unlocks their potential, enhancing model capacity without increasing model size. Our code is available at https://github.com/pixeli99/MixLN.
Related papers
- LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.
Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.
We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - The Curse of Depth in Large Language Models [28.37870372690079]
We introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models (LLMs)
We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen.
Our experimental results, spanning model sizes from 130M to 1B, demonstrate that LayerNorm Scaling significantly enhances LLM pre-training performance compared to Pre-LN.
arXiv Detail & Related papers (2025-02-09T07:03:36Z) - Peri-LN: Revisiting Layer Normalization in the Transformer Architecture [57.08322913112157]
Pre-LN and Post-LN have long dominated standard practices despite their limitations in large-scale training.
Several open-source large-scale models have recently begun silently adopting a third strategy without much explanation.
We show that Peri-LN consistently achieves more balanced variance growth, steadier gradient flow, and convergence stability.
arXiv Detail & Related papers (2025-02-04T21:29:47Z) - AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models [94.82766517752418]
We propose AlphaPruning, which uses shape metrics to allocate layerwise sparsity ratios in a more theoretically principled manner.
Our results show that AlphaPruning prunes LLaMA-7B to 80% sparsity while maintaining reasonable perplexity, marking a first in the literature on LLMs.
arXiv Detail & Related papers (2024-10-14T03:35:11Z) - On the Nonlinearity of Layer Normalization [5.0464797863553414]
We investigate the representation capacity of a network with layerwise composition of linear and LN transformations, referred to as LN-Net.
We show that, given $m$ samples with any label assignment, an LN-Net with only 3 neurons in each layer and $O(m)$ LN layers can correctly classify them.
arXiv Detail & Related papers (2024-06-03T12:11:34Z) - DoLa: Decoding by Contrasting Layers Improves Factuality in Large
Language Models [79.01926242857613]
Large language models (LLMs) are prone to hallucinations, generating content that deviates from facts seen during pretraining.
We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs.
We find that this Decoding by Contrasting Layers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts.
arXiv Detail & Related papers (2023-09-07T17:45:31Z) - Understanding the Role of Layer Normalization in Label-Skewed Federated
Learning [15.19762600396105]
Layer normalization (LN) is a widely adopted deep learning technique especially in the era of foundation models.
In this work, we reveal the profound connection between layer normalization and the label shift problem in federated learning.
Our results verify that FN is an essential ingredient inside LN to significantly improve the convergence of FL while remaining robust to learning rate choices.
arXiv Detail & Related papers (2023-08-18T13:57:04Z) - ResiDual: Transformer with Dual Residual Connections [106.38073506751003]
Two widely used variants are the Post-Layer-Normalization (Post-LN) and Pre-Layer-Normalization (Pre-LN)
Post-LN causes gradient vanishing issue that hinders training deep Transformers, and Pre-LN causes representation collapse issue that limits model capacity.
We propose ResiDual, a novel Transformer architecture with Pre-Post-LN (PPLN), which fuses connections in Post-LN Pre-LN together.
arXiv Detail & Related papers (2023-04-28T12:19:47Z) - B2T Connection: Serving Stability and Performance in Deep Transformers [40.44674210101826]
Recent Transformers tend to be Pre-LN because, in Post-LN with deep Transformers, the training is often unstable, resulting in useless models.
Post-LN has consistently achieved better performance than Pre-LN in relatively shallow Transformers.
We propose a method that can provide both high stability and effective training by a simple modification of Post-LN.
arXiv Detail & Related papers (2022-06-01T08:43:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.