Do Language Models Use Their Depth Efficiently?
- URL: http://arxiv.org/abs/2505.13898v2
- Date: Fri, 30 May 2025 01:23:59 GMT
- Title: Do Language Models Use Their Depth Efficiently?
- Authors: Róbert Csordás, Christopher D. Manning, Christopher Potts,
- Abstract summary: We analyze the residual stream of the Llama 3.1 and Qwen 3 family of models.<n>We find that layers in the second half contribute much less than those in the first half.<n>For multihop tasks, we are unable to find evidence that models are using increased depth to compose subresults.
- Score: 53.56816097840505
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern LLMs are increasingly deep, and depth correlates with performance, albeit with diminishing returns. However, do these models use their depth efficiently? Do they compose more features to create higher-order computations that are impossible in shallow models, or do they merely spread the same kinds of computation out over more layers? To address these questions, we analyze the residual stream of the Llama 3.1 and Qwen 3 family of models. We find: First, comparing the output of the sublayers to the residual stream reveals that layers in the second half contribute much less than those in the first half, with a clear phase transition between the two halves. Second, skipping layers in the second half has a much smaller effect on future computations and output predictions. Third, for multihop tasks, we are unable to find evidence that models are using increased depth to compose subresults in examples involving many hops. Fourth, we seek to directly address whether deeper models are using their additional layers to perform new kinds of computation. To do this, we train linear maps from the residual stream of a shallow model to a deeper one. We find that layers with the same relative depth map best to each other, suggesting that the larger model simply spreads the same computations out over its many layers. All this evidence suggests that deeper models are not using their depth to learn new kinds of computation, but only using the greater depth to perform more fine-grained adjustments to the residual. This may help explain why increasing scale leads to diminishing returns for stacked Transformer architectures.
Related papers
- From Words to Amino Acids: Does the Curse of Depth Persist? [41.90462432884248]
We present a depth analysis of six popular protein language models (PLMs) across model families and scales.<n>We observe consistent depth-dependent patterns that extend prior findings on large language models (LLMs)<n>Our results suggest that PLMs exhibit a form of depth inefficiency, motivating future work on more depth-efficient architectures and training methods.
arXiv Detail & Related papers (2026-02-25T10:06:12Z) - From Growing to Looping: A Unified View of Iterative Computation in LLMs [42.46854018848624]
Looping and depth-grown models exhibit convergent depth-wise signatures.<n>Applying inference-time looping to the middle blocks of a depth-grown model improves accuracy by up to $2times$.<n> depth-grown models achieve the largest reasoning gains when using higher-quality, math-heavy mixtures.
arXiv Detail & Related papers (2026-02-18T14:25:16Z) - What Affects the Effective Depth of Large Language Models? [44.85395501835759]
We study how effective depth varies with model scale, training type, and task difficulty.<n>We find that while the number of effective layers grows with model size, the effective depth ratio remains stable.<n>Our results suggest that current LLMs underuse available depth across scales, training paradigms, and tasks of varying difficulty.
arXiv Detail & Related papers (2025-12-16T04:07:17Z) - Compute-Optimal Scaling for Value-Based Deep RL [99.680827753493]
We investigate compute scaling for online, value-based deep RL.<n>Our analysis reveals a nuanced interplay between model size, batch size, and UTD.<n>We provide a mental model for understanding this phenomenon and build guidelines for choosing batch size and UTD.
arXiv Detail & Related papers (2025-08-20T17:54:21Z) - LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - Scale Propagation Network for Generalizable Depth Completion [16.733495588009184]
We propose a novel scale propagation normalization (SP-Norm) method to propagate scales from input to output.
We also develop a new network architecture based on SP-Norm and the ConvNeXt V2 backbone.
Our model consistently achieves the best accuracy with faster speed and lower memory when compared to state-of-the-art methods.
arXiv Detail & Related papers (2024-10-24T03:53:06Z) - Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries [39.438904598467154]
We study how large language models (LLMs) solve complex multi-step problems.
understanding how the latent step is computed internally is key to understanding the overall computation.
We propose a novel "back-patching" analysis method whereby a hidden representation from a later layer is patched back to an earlier layer.
arXiv Detail & Related papers (2024-06-18T16:44:13Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models [61.363259848264725]
Inheritune is a simple and effective training recipe for building smaller, more efficient language models.<n>We show that Inheritune trained models, despite having significantly fewer layers, can match or even outperform their larger counterparts.
arXiv Detail & Related papers (2024-04-12T17:53:34Z) - Layer-wise Linear Mode Connectivity [52.6945036534469]
Averaging neural network parameters is an intuitive method for the knowledge of two independent models.
It is most prominently used in federated learning.
We analyse the performance of the models that result from averaging single, or groups.
arXiv Detail & Related papers (2023-07-13T09:39:10Z) - Improved Convergence Guarantees for Shallow Neural Networks [91.3755431537592]
We prove convergence of depth 2 neural networks, trained via gradient descent, to a global minimum.
Our model has the following features: regression with quadratic loss function, fully connected feedforward architecture, RelU activations, Gaussian data instances, adversarial labels.
They strongly suggest that, at least in our model, the convergence phenomenon extends well beyond the NTK regime''
arXiv Detail & Related papers (2022-12-05T14:47:52Z) - HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation [14.81943833870932]
We present an improvedDepthNet, HR-Depth, with two effective strategies.
Using Resnet-18 as the encoder, HR-Depth surpasses all pre-vious state-of-the-art(SoTA) methods with the least param-eters at both high and low resolution.
arXiv Detail & Related papers (2020-12-14T09:15:15Z) - Parameter Efficient Deep Neural Networks with Bilinear Projections [16.628045837101237]
We address the parameter redundancy problem in deep neural networks (DNNs) by replacing conventional full projections with bilinear projections.
For a fully-connected layer with $D$ input nodes and $D$ output nodes, applying bilinear projection can reduce the model space complexity.
Experiments on four benchmark datasets show that applying the proposed bilinear projection to deep neural networks can achieve even higher accuracies.
arXiv Detail & Related papers (2020-11-03T00:17:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.