Related papers: Do Language Models Use Their Depth Efficiently?

Do Language Models Use Their Depth Efficiently?

URL: http://arxiv.org/abs/2505.13898v2
Date: Fri, 30 May 2025 01:23:59 GMT
Title: Do Language Models Use Their Depth Efficiently?
Authors: Róbert Csordás, Christopher D. Manning, Christopher Potts,
Abstract summary: We analyze the residual stream of the Llama 3.1 and Qwen 3 family of models.<n>We find that layers in the second half contribute much less than those in the first half.<n>For multihop tasks, we are unable to find evidence that models are using increased depth to compose subresults.
Score: 53.56816097840505
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern LLMs are increasingly deep, and depth correlates with performance, albeit with diminishing returns. However, do these models use their depth efficiently? Do they compose more features to create higher-order computations that are impossible in shallow models, or do they merely spread the same kinds of computation out over more layers? To address these questions, we analyze the residual stream of the Llama 3.1 and Qwen 3 family of models. We find: First, comparing the output of the sublayers to the residual stream reveals that layers in the second half contribute much less than those in the first half, with a clear phase transition between the two halves. Second, skipping layers in the second half has a much smaller effect on future computations and output predictions. Third, for multihop tasks, we are unable to find evidence that models are using increased depth to compose subresults in examples involving many hops. Fourth, we seek to directly address whether deeper models are using their additional layers to perform new kinds of computation. To do this, we train linear maps from the residual stream of a shallow model to a deeper one. We find that layers with the same relative depth map best to each other, suggesting that the larger model simply spreads the same computations out over its many layers. All this evidence suggests that deeper models are not using their depth to learn new kinds of computation, but only using the greater depth to perform more fine-grained adjustments to the residual. This may help explain why increasing scale leads to diminishing returns for stacked Transformer architectures.

Related papers

LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
Scale Propagation Network for Generalizable Depth Completion [16.733495588009184]
We propose a novel scale propagation normalization (SP-Norm) method to propagate scales from input to output. We also develop a new network architecture based on SP-Norm and the ConvNeXt V2 backbone. Our model consistently achieves the best accuracy with faster speed and lower memory when compared to state-of-the-art methods.
arXiv Detail & Related papers (2024-10-24T03:53:06Z)
Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries [39.438904598467154]
We study how large language models (LLMs) solve complex multi-step problems. understanding how the latent step is computed internally is key to understanding the overall computation. We propose a novel "back-patching" analysis method whereby a hidden representation from a later layer is patched back to an earlier layer.
arXiv Detail & Related papers (2024-06-18T16:44:13Z)
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs. We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z)
When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models [61.363259848264725]
Inheritune is a simple and effective training recipe for building smaller, more efficient language models.<n>We show that Inheritune trained models, despite having significantly fewer layers, can match or even outperform their larger counterparts.
arXiv Detail & Related papers (2024-04-12T17:53:34Z)
Layer-wise Linear Mode Connectivity [52.6945036534469]
Averaging neural network parameters is an intuitive method for the knowledge of two independent models. It is most prominently used in federated learning. We analyse the performance of the models that result from averaging single, or groups.
arXiv Detail & Related papers (2023-07-13T09:39:10Z)
Improved Convergence Guarantees for Shallow Neural Networks [91.3755431537592]
We prove convergence of depth 2 neural networks, trained via gradient descent, to a global minimum. Our model has the following features: regression with quadratic loss function, fully connected feedforward architecture, RelU activations, Gaussian data instances, adversarial labels. They strongly suggest that, at least in our model, the convergence phenomenon extends well beyond the NTK regime''
arXiv Detail & Related papers (2022-12-05T14:47:52Z)
HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation [14.81943833870932]
We present an improvedDepthNet, HR-Depth, with two effective strategies. Using Resnet-18 as the encoder, HR-Depth surpasses all pre-vious state-of-the-art(SoTA) methods with the least param-eters at both high and low resolution.
arXiv Detail & Related papers (2020-12-14T09:15:15Z)
Parameter Efficient Deep Neural Networks with Bilinear Projections [16.628045837101237]
We address the parameter redundancy problem in deep neural networks (DNNs) by replacing conventional full projections with bilinear projections. For a fully-connected layer with $D$ input nodes and $D$ output nodes, applying bilinear projection can reduce the model space complexity. Experiments on four benchmark datasets show that applying the proposed bilinear projection to deep neural networks can achieve even higher accuracies.
arXiv Detail & Related papers (2020-11-03T00:17:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.