The Curse of Depth in Large Language Models
- URL: http://arxiv.org/abs/2502.05795v1
- Date: Sun, 09 Feb 2025 07:03:36 GMT
- Title: The Curse of Depth in Large Language Models
- Authors: Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, Shiwei Liu,
- Abstract summary: We introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models (LLMs)
We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen.
Our experimental results, spanning model sizes from 130M to 1B, demonstrate that LayerNorm Scaling significantly enhances LLM pre-training performance compared to Pre-LN.
- Score: 28.37870372690079
- License:
- Abstract: In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models(LLMs) where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling, which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Our experimental results, spanning model sizes from 130M to 1B, demonstrate that LayerNorm Scaling significantly enhances LLM pre-training performance compared to Pre-LN. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training.
Related papers
- Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective [55.90119819642064]
We address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective.
This refers to the cumulative effect of reconstruction errors throughout the sparsification process.
We derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue.
arXiv Detail & Related papers (2025-02-20T17:51:10Z) - LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.
Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.
We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models [94.82766517752418]
We propose AlphaPruning, which uses shape metrics to allocate layerwise sparsity ratios in a more theoretically principled manner.
Our results show that AlphaPruning prunes LLaMA-7B to 80% sparsity while maintaining reasonable perplexity, marking a first in the literature on LLMs.
arXiv Detail & Related papers (2024-10-14T03:35:11Z) - Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [66.27334633749734]
As language models grow in size, memory demands for backpropagation increase.
Zeroth-order (ZOZO) optimization methods offer a memory-efficient alternative.
We show that SubZero enhances fine-tuning and achieves faster results compared to standard ZOZO approaches.
arXiv Detail & Related papers (2024-10-11T17:01:43Z) - Investigating Layer Importance in Large Language Models [28.156622049937216]
Large language models (LLMs) have gained increasing attention due to their prominent ability to understand and process texts.
The lack of understanding of LLMs has obstructed the deployment in safety-critical scenarios and hindered the development of better models.
This study identifies cornerstone layers in LLMs and underscores their critical role for future research.
arXiv Detail & Related papers (2024-09-22T09:53:13Z) - DLO: Dynamic Layer Operation for Efficient Vertical Scaling of LLMs [46.443316184807145]
We introduce Dynamic Layer Operations (DLO), a novel approach for vertically scaling transformer-based Large Language Models (LLMs)
Unlike traditional Mixture-of-Experts (MoE) methods that focus on extending the model width, our approach targets model depth, addressing the redundancy observed across layer representations for various input samples.
Experimental results demonstrate that DLO not only outperforms the original unscaled models but also achieves comparable results to densely expanded models with significantly improved efficiency.
arXiv Detail & Related papers (2024-07-03T18:34:08Z) - The Unreasonable Ineffectiveness of the Deeper Layers [5.984361440126354]
We study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs.
We find minimal degradation of performance until after a large fraction of the layers are removed.
From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge.
arXiv Detail & Related papers (2024-03-26T17:20:04Z) - ShortGPT: Layers in Large Language Models are More Redundant Than You Expect [38.148626520751385]
We show that many layers of Large Language Models (LLMs) exhibit high similarity, and some layers play a negligible role in network functionality.
We propose a straightforward pruning approach: layer removal, in which we directly delete the redundant layers.
Experiments demonstrate that our method, which we call ShortGPT, significantly outperforms previous state-of-the-art (SOTA) methods in model pruning.
arXiv Detail & Related papers (2024-03-06T17:04:18Z) - Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity [88.62935593360162]
Large Language Models (LLMs) are renowned for their remarkable performance across diverse domains.
We introduce a novel LLM pruning methodology that incorporates a tailored set of non-uniform layerwise sparsity ratios, termed as Outlier Weighed Layerwise sparsity (OWL)
OWL exhibits a remarkable performance gain, surpassing the state-of-the-art Wanda and SparseGPT by 61.22 and 6.80 perplexity at a high sparsity level of 70%, respectively.
arXiv Detail & Related papers (2023-10-08T14:22:58Z) - DoLa: Decoding by Contrasting Layers Improves Factuality in Large
Language Models [79.01926242857613]
Large language models (LLMs) are prone to hallucinations, generating content that deviates from facts seen during pretraining.
We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs.
We find that this Decoding by Contrasting Layers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts.
arXiv Detail & Related papers (2023-09-07T17:45:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.