How Do LLMs Use Their Depth?
- URL: http://arxiv.org/abs/2510.18871v1
- Date: Tue, 21 Oct 2025 17:59:05 GMT
- Title: How Do LLMs Use Their Depth?
- Authors: Akshat Gupta, Jay Yeung, Gopala Anumanchipalli, Anna Ivanova,
- Abstract summary: We show that large language models do not use their depth uniformly, yet we still lack a fine-grained understanding of their layer-wise prediction dynamics.<n>We propose a "Guess-then-Refine" framework that explains how LLMs internally structure their computations to make predictions.
- Score: 17.148445769990907
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Growing evidence suggests that large language models do not use their depth uniformly, yet we still lack a fine-grained understanding of their layer-wise prediction dynamics. In this paper, we trace the intermediate representations of several open-weight models during inference and reveal a structured and nuanced use of depth. Specifically, we propose a "Guess-then-Refine" framework that explains how LLMs internally structure their computations to make predictions. We first show that the top-ranked predictions in early LLM layers are composed primarily of high-frequency tokens, which act as statistical guesses proposed by the model early on due to the lack of appropriate contextual information. As contextual information develops deeper into the model, these initial guesses get refined into contextually appropriate tokens. Even high-frequency token predictions from early layers get refined >70% of the time, indicating that correct token prediction is not "one-and-done". We then go beyond frequency-based prediction to examine the dynamic usage of layer depth across three case studies. (i) Part-of-speech analysis shows that function words are, on average, the earliest to be predicted correctly. (ii) Fact recall task analysis shows that, in a multi-token answer, the first token requires more computational depth than the rest. (iii) Multiple-choice task analysis shows that the model identifies the format of the response within the first half of the layers, but finalizes its response only toward the end. Together, our results provide a detailed view of depth usage in LLMs, shedding light on the layer-by-layer computations that underlie successful predictions and providing insights for future works to improve computational efficiency in transformer-based models.
Related papers
- Learning to Compress: Unlocking the Potential of Large Language Models for Text Representation [34.21806963402883]
We study the untapped potential of context compression as a pretext task for unsupervised adaptation of large language models (LLMs)<n> Experiments demonstrate that a well-designed compression objective can significantly enhance LLM-based text representations.<n>Further improvements through contrastive learning produce a strong representation model (LLM2Comp)
arXiv Detail & Related papers (2025-11-21T10:45:44Z) - When can isotropy help adapt LLMs' next word prediction to numerical domains? [53.98633183204453]
It is shown that the isotropic property of LLM embeddings in contextual embedding space preserves the underlying structure of representations.<n> Experiments show that different characteristics of numerical data and model architectures have different impacts on isotropy.
arXiv Detail & Related papers (2025-05-22T05:10:34Z) - Explainable Multi-modal Time Series Prediction with LLM-in-the-Loop [63.34626300024294]
TimeXL is a multi-modal prediction framework that integrates a prototype-based time series encoder.<n>It produces more accurate predictions and interpretable explanations.<n> Empirical evaluations on four real-world datasets demonstrate that TimeXL achieves up to 8.9% improvement in AUC.
arXiv Detail & Related papers (2025-03-02T20:40:53Z) - Unraveling Token Prediction Refinement and Identifying Essential Layers in Language Models [0.0]
This research aims to unravel how large language models (LLMs) iteratively refine token predictions through internal processing.<n>We focused on how LLMs access and utilize information from input contexts, and how positioning of relevant information affects the model's token prediction refinement process.
arXiv Detail & Related papers (2025-01-25T03:34:15Z) - Interpretable Next-token Prediction via the Generalized Induction Head [59.500195503897764]
Generalized Induction-Head Model (GIM) is an interpretable model for next-token prediction.<n>In language modeling, GIM improves next-token prediction by up to 25%p over interpretable baselines.<n>In an fMRI setting, GIM improves neural response prediction by 20%.
arXiv Detail & Related papers (2024-10-31T12:33:26Z) - FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - Learning to Generate Explainable Stock Predictions using Self-Reflective
Large Language Models [54.21695754082441]
We propose a framework to teach Large Language Models (LLMs) to generate explainable stock predictions.
A reflective agent learns how to explain past stock movements through self-reasoning, while the PPO trainer trains the model to generate the most likely explanations.
Our framework can outperform both traditional deep-learning and LLM methods in prediction accuracy and Matthews correlation coefficient.
arXiv Detail & Related papers (2024-02-06T03:18:58Z) - Evaluating and Explaining Large Language Models for Code Using Syntactic
Structures [74.93762031957883]
This paper introduces ASTxplainer, an explainability method specific to Large Language Models for code.
At its core, ASTxplainer provides an automated method for aligning token predictions with AST nodes.
We perform an empirical evaluation on 12 popular LLMs for code using a curated dataset of the most popular GitHub projects.
arXiv Detail & Related papers (2023-08-07T18:50:57Z) - Enhancing Speech Recognition Decoding via Layer Aggregation [7.056222499095849]
We show that logits predicted using the top layers may hamper beam search from achieving optimal results.
We propose a prediction method that aggregates the top M layers, potentially leveraging useful information encoded in intermediate layers and relaxing model confidence.
arXiv Detail & Related papers (2022-03-21T20:28:06Z) - Deep Learning Through the Lens of Example Difficulty [21.522182447513632]
We introduce a measure of the computational difficulty of making a prediction for a given input: the (effective) prediction depth.
Our investigation reveals surprising yet simple relationships between the prediction depth of a given input and the model's uncertainty, confidence, accuracy and speed of learning for that data point.
arXiv Detail & Related papers (2021-06-17T16:48:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.