Are language models aware of the road not taken? Token-level uncertainty and hidden state dynamics
- URL: http://arxiv.org/abs/2511.04527v1
- Date: Thu, 06 Nov 2025 16:43:25 GMT
- Title: Are language models aware of the road not taken? Token-level uncertainty and hidden state dynamics
- Authors: Amir Zur, Atticus Geiger, Ekdeep Singh Lubana, Eric Bigelow,
- Abstract summary: We use hidden activations to control and predict a language model's uncertainty during chain-of-thought reasoning.<n>We find a clear correlation between how uncertain a model is at different tokens, and how easily the model can be steered by controlling its activations.<n>We also find that hidden activations can predict a model's future outcome distribution, demonstrating that models implicitly represent the space of possible paths.
- Score: 21.8640687271413
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When a language model generates text, the selection of individual tokens might lead it down very different reasoning paths, making uncertainty difficult to quantify. In this work, we consider whether reasoning language models represent the alternate paths that they could take during generation. To test this hypothesis, we use hidden activations to control and predict a language model's uncertainty during chain-of-thought reasoning. In our experiments, we find a clear correlation between how uncertain a model is at different tokens, and how easily the model can be steered by controlling its activations. This suggests that activation interventions are most effective when there are alternate paths available to the model -- in other words, when it has not yet committed to a particular final answer. We also find that hidden activations can predict a model's future outcome distribution, demonstrating that models implicitly represent the space of possible paths.
Related papers
- Reasoning aligns language models to human cognition [12.07126784684808]
We introduce an active probabilistic reasoning task that cleanly separates sampling (actively acquiring evidence) from inference (integrating evidence toward a decision)<n> Benchmarking humans and a broad set of contemporary large language models against near-optimal reference policies reveals a consistent pattern.<n>This model places humans and models in a shared low-dimensional cognitive space, reproduces behavioral signatures across agents, and shows how chain-of-thought shifts language models toward human-like regimes of evidence accumulation and belief-to-choice mapping.
arXiv Detail & Related papers (2026-02-09T14:13:39Z) - Emergent Introspective Awareness in Large Language Models [2.2458442204933]
We investigate whether large language models can introspect on their internal states.<n>We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them.<n>Claude Opus 4 and 4.1, the most capable models, generally demonstrate the greatest introspective awareness.
arXiv Detail & Related papers (2026-01-05T06:47:41Z) - Temporal Predictors of Outcome in Reasoning Language Models [0.0]
Chain-of-thought (CoT) paradigm uses the elicitation of step-by-step rationales as a proxy for reasoning.<n>We show that, for harder questions, a drop in predictive accuracy highlights a selection artifact.<n>Overall, our results imply that for reasoning models, internal self-assessment of success tends to emerge after only a few tokens.
arXiv Detail & Related papers (2025-11-03T08:57:18Z) - Emergence of Linear Truth Encodings in Language Models [64.86571541830598]
Large language models exhibit linear subspaces that separate true from false statements, yet the mechanism behind their emergence is unclear.<n>We introduce a transparent, one-layer transformer toy model that reproduces such truth subspaces end-to-end.<n>We study one simple setting in which truth encoding can emerge, encouraging the model to learn this distinction in order to lower the LM loss on future tokens.
arXiv Detail & Related papers (2025-10-17T16:30:07Z) - Pretrained LLMs Learn Multiple Types of Uncertainty [23.807232455808613]
Large Language Models are known to capture real-world knowledge, allowing them to excel in many downstream tasks.<n>In this work, we study how well LLMs capture uncertainty, without explicitly being trained for that.<n>We show that, if considering uncertainty as a linear concept in the model's latent space, it might indeed be captured, even after only pretraining.
arXiv Detail & Related papers (2025-05-27T14:06:15Z) - A Psycholinguistic Evaluation of Language Models' Sensitivity to Argument Roles [0.06554326244334868]
We evaluate large language models' sensitivity to argument roles by replicating psycholinguistic studies on human argument role processing.
We find that language models are able to distinguish verbs that appear in plausible and implausible contexts, where plausibility is determined through the relation between the verb and its preceding arguments.
This indicates that language models' capacity to detect verb plausibility does not arise from the same mechanism that underlies human real-time sentence processing.
arXiv Detail & Related papers (2024-10-21T16:05:58Z) - Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models [113.58052868898173]
We identify and characterize a phenomenon never discussed before, where models leak irrelevant information from the prompt into the generation in unexpected ways.<n>We propose an evaluation setting to detect semantic leakage both by humans and automatically, curate a diverse test suite for diagnosing this behavior, and measure significant semantic leakage in 13 flagship models.
arXiv Detail & Related papers (2024-08-12T22:30:55Z) - Meanings and Feelings of Large Language Models: Observability of Latent States in Generative AI [65.04274914674771]
We show that current Large Language Models (LLMs) cannot have 'feelings' according to the American Psychological Association (APA)
Our analysis sheds light on possible designs that would enable a model to perform non-trivial computation that is not visible to the user.
arXiv Detail & Related papers (2024-05-22T23:18:58Z) - Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth.
We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way.
We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z) - Interpreting Language Models with Contrastive Explanations [99.7035899290924]
Language models must consider various features to predict a token, such as its part of speech, number, tense, or semantics.
Existing explanation methods conflate evidence for all these features into a single explanation, which is less interpretable for human understanding.
We show that contrastive explanations are quantifiably better than non-contrastive explanations in verifying major grammatical phenomena.
arXiv Detail & Related papers (2022-02-21T18:32:24Z) - Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns.
Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.