Related papers: Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis

Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis

URL: http://arxiv.org/abs/2512.08819v1
Date: Tue, 09 Dec 2025 17:12:04 GMT
Title: Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis
Authors: Ferdinand Kapl, Emmanouil Angelis, Tobias Höppe, Kaitlin Maile, Johannes von Oswald, Nino Scherrer, Stefan Bauer,
Abstract summary: We show that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half.<n>This work highlights how the gradual growth of model depth can lead to the formation of distinct computational circuits.
Score: 40.72065859626204
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Gradually growing the depth of Transformers during training can not only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS (Saunshi et al., 2024). Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half - also known as the Curse of Depth (Sun et al., 2025, Csordás et al., 2025). Using depth-wise analyses, we demonstrate that growth via gradual middle stacking yields more effective utilization of model depth, alters the residual stream structure, and facilitates the formation of permutable computational blocks. In addition, we propose a lightweight modification of MIDAS that yields further improvements in downstream reasoning benchmarks. Overall, this work highlights how the gradual growth of model depth can lead to the formation of distinct computational circuits and overcome the limited depth utilization seen in standard non-grown models.

Related papers

From Growing to Looping: A Unified View of Iterative Computation in LLMs [42.46854018848624]
Looping and depth-grown models exhibit convergent depth-wise signatures.<n>Applying inference-time looping to the middle blocks of a depth-grown model improves accuracy by up to $2times$.<n> depth-grown models achieve the largest reasoning gains when using higher-quality, math-heavy mixtures.
arXiv Detail & Related papers (2026-02-18T14:25:16Z)
Generative Model Inversion Through the Lens of the Manifold Hypothesis [98.37040155914595]
Model inversion attacks (MIAs) aim to reconstruct class-representative samples from trained models.<n>Recent generative MIAs utilize generative adversarial networks to learn image priors that guide the inversion process.
arXiv Detail & Related papers (2025-09-24T14:39:25Z)
Region-aware Depth Scale Adaptation with Sparse Measurements [8.532410904912922]
We introduce a non-learning-based approach to adapt the relative-scale predictions of foundation models into metric-scale depth.<n>Our method requires neither retraining nor fine-tuning, thereby preserving the strong generalization ability of the original foundation models.
arXiv Detail & Related papers (2025-07-20T09:36:57Z)
Do Language Models Use Their Depth Efficiently? [61.0037917291838]
We analyze the residual stream of the Llama 3.1, Qwen 3, and OLMo 2 family of models.<n>We find that layers in the second half contribute much less than those in the first half.<n>For multihop tasks, we are unable to find evidence that models are using increased depth to compose subresults.
arXiv Detail & Related papers (2025-05-20T04:00:56Z)
The Curse of Depth in Large Language Models [28.37870372690079]
In large language models, nearly half of the layers are less effective than expected.<n>LayerNorm Scaling (LNS) scales the variance of output of the layer normalization inversely by the square root of its depth.<n>LNS consistently outperforms previous normalization and scaling techniques in enhancing LLM pre-training performance.
arXiv Detail & Related papers (2025-02-09T07:03:36Z)
DepthART: Monocular Depth Estimation as Autoregressive Refinement Task [2.3884184860468136]
We introduce DepthART - a novel training method formulated as a Depth Autoregressive Refinement Task.<n>By utilizing the model's own predictions as inputs, we frame the objective as residual minimization, effectively reducing the discrepancy between training and inference procedures.<n>When trained on Hypersim dataset using our approach, the model achieves superior results across multiple unseen benchmarks compared to existing generative and discriminative baselines.
arXiv Detail & Related papers (2024-09-23T13:36:34Z)
LaCo: Large Language Model Pruning via Layer Collapse [56.92068213969036]
Large language models (LLMs) based on transformer are witnessing a notable trend of size expansion. Existing methods such as model quantization, knowledge distillation, and model pruning are constrained by various issues. We propose a concise layer-wise structured pruner called textitLayer Collapse (LaCo), in which rear model layers collapse into a prior layer.
arXiv Detail & Related papers (2024-02-17T04:16:30Z)
Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN [9.185929396989083]
We employ a sparse pixel approach to contrastively analyze the distinctions between Transformers and CNNs. Our findings suggest that while Transformers excel in handling global context and intricate textures, they lag behind CNNs in preserving depth gradient continuity. We propose the Depth Gradient Refinement (DGR) module that refines depth estimation through high-order differentiation, feature fusion, and recalibration.
arXiv Detail & Related papers (2023-08-16T12:46:52Z)
Unlocking the Potential of Federated Learning for Deeper Models [24.875271131226707]
Federated learning (FL) is a new paradigm for distributed machine learning that allows a global model to be trained across multiple clients. We propose several technical guidelines based on reducing divergence, such as using wider models and reducing the receptive field. These approaches can greatly improve the accuracy of FL on deeper models.
arXiv Detail & Related papers (2023-06-05T08:45:44Z)
Extended Unconstrained Features Model for Exploring Deep Neural Collapse [59.59039125375527]
Recently, a phenomenon termed "neural collapse" (NC) has been empirically observed in deep neural networks. Recent papers have shown that minimizers with this structure emerge when optimizing a simplified "unconstrained features model" In this paper, we study the UFM for the regularized MSE loss, and show that the minimizers' features can be more structured than in the cross-entropy case.
arXiv Detail & Related papers (2022-02-16T14:17:37Z)
Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models. Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely. Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.