Related papers: What Affects the Effective Depth of Large Language Models?

What Affects the Effective Depth of Large Language Models?

URL: http://arxiv.org/abs/2512.14064v1
Date: Tue, 16 Dec 2025 04:07:17 GMT
Title: What Affects the Effective Depth of Large Language Models?
Authors: Yi Hu, Cai Zhou, Muhan Zhang,
Abstract summary: We study how effective depth varies with model scale, training type, and task difficulty.<n>We find that while the number of effective layers grows with model size, the effective depth ratio remains stable.<n>Our results suggest that current LLMs underuse available depth across scales, training paradigms, and tasks of varying difficulty.
Score: 44.85395501835759
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The scaling of large language models (LLMs) emphasizes increasing depth, yet performance gains diminish with added layers. Prior work introduces the concept of "effective depth", arguing that deeper models fail to fully utilize their layers for meaningful computation. Building on this, we systematically study how effective depth varies with model scale, training type, and task difficulty. First, we analyze the model behavior of Qwen-2.5 family (1.5B-32B) and find that while the number of effective layers grows with model size, the effective depth ratio remains stable. Besides, comparisons between base and corresponding long-CoT models show no increase in effective depth, suggesting that improved reasoning stems from longer context rather than deeper per-token computation. Furthermore, evaluations across tasks of varying difficulty indicate that models do not dynamically use more layers for harder problems. Our results suggest that current LLMs underuse available depth across scales, training paradigms and tasks of varying difficulties, pointing out research opportunities on increasing the layer utilization rate of LLMs, model pruning, and early exiting. Our code is released at https://github.com/AheadOFpotato/what_affects_effective_depth.

Related papers

From Words to Amino Acids: Does the Curse of Depth Persist? [41.90462432884248]
We present a depth analysis of six popular protein language models (PLMs) across model families and scales.<n>We observe consistent depth-dependent patterns that extend prior findings on large language models (LLMs)<n>Our results suggest that PLMs exhibit a form of depth inefficiency, motivating future work on more depth-efficient architectures and training methods.
arXiv Detail & Related papers (2026-02-25T10:06:12Z)
Inverse Depth Scaling From Most Layers Being Similar [20.276718813247786]
We quantify how depth affects loss via analysis of large language models (LLMs)<n>We find loss scales inversely proportional to depth in LLMs, probably due to functionally similar layers reducing error through ensemble averaging.
arXiv Detail & Related papers (2026-02-05T18:22:41Z)
Compute-Optimal Scaling for Value-Based Deep RL [99.680827753493]
We investigate compute scaling for online, value-based deep RL.<n>Our analysis reveals a nuanced interplay between model size, batch size, and UTD.<n>We provide a mental model for understanding this phenomenon and build guidelines for choosing batch size and UTD.
arXiv Detail & Related papers (2025-08-20T17:54:21Z)
TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression [55.37723860832064]
We propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations.<n>We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels.
arXiv Detail & Related papers (2025-06-03T09:23:41Z)
Do Language Models Use Their Depth Efficiently? [61.0037917291838]
We analyze the residual stream of the Llama 3.1, Qwen 3, and OLMo 2 family of models.<n>We find that layers in the second half contribute much less than those in the first half.<n>For multihop tasks, we are unable to find evidence that models are using increased depth to compose subresults.
arXiv Detail & Related papers (2025-05-20T04:00:56Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.