Related papers: Representational Stability of Truth in Large Language Models

Representational Stability of Truth in Large Language Models

URL: http://arxiv.org/abs/2511.19166v1
Date: Mon, 24 Nov 2025 14:28:50 GMT
Title: Representational Stability of Truth in Large Language Models
Authors: Samantha Dies, Courtney Maynard, Germans Savcisens, Tina Eliassi-Rad,
Abstract summary: We introduce representational stability as the robustness of an LLM's veracity representations to perturbations in the operational definition of truth.<n>We compare two types of neither statements: fact-like assertions about entities we believe to be absent from any training data, and nonfactual claims drawn from well-known fictional contexts.<n>The unfamiliar statements induce the largest boundary shifts, producing up to $40%$ flipped truth in fragile domains.
Score: 0.15655985886975654
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are widely used for factual tasks such as "What treats asthma?" or "What is the capital of Latvia?". However, it remains unclear how stably LLMs encode distinctions between true, false, and neither-true-nor-false content in their internal probabilistic representations. We introduce representational stability as the robustness of an LLM's veracity representations to perturbations in the operational definition of truth. We assess representational stability by (i) training a linear probe on an LLM's activations to separate true from not-true statements and (ii) measuring how its learned decision boundary shifts under controlled label changes. Using activations from sixteen open-source models and three factual domains, we compare two types of neither statements. The first are fact-like assertions about entities we believe to be absent from any training data. We call these unfamiliar neither statements. The second are nonfactual claims drawn from well-known fictional contexts. We call these familiar neither statements. The unfamiliar statements induce the largest boundary shifts, producing up to $40\%$ flipped truth judgements in fragile domains (such as word definitions), while familiar fictional statements remain more coherently clustered and yield smaller changes ($\leq 8.2\%$). These results suggest that representational stability stems more from epistemic familiarity than from linguistic form. More broadly, our approach provides a diagnostic for auditing and training LLMs to preserve coherent truth assignments under semantic uncertainty, rather than optimizing for output accuracy alone.

Related papers

Emergence of Linear Truth Encodings in Language Models [64.86571541830598]
Large language models exhibit linear subspaces that separate true from false statements, yet the mechanism behind their emergence is unclear.<n>We introduce a transparent, one-layer transformer toy model that reproduces such truth subspaces end-to-end.<n>We study one simple setting in which truth encoding can emerge, encouraging the model to learn this distinction in order to lower the LM loss on future tokens.
arXiv Detail & Related papers (2025-10-17T16:30:07Z)
LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance [19.466678464397216]
We show that internal representations of statement truthfulness collapse as the samples' presentations become less similar to those seen during pre-training.<n>These findings offer a possible explanation for brittle benchmark performance.
arXiv Detail & Related papers (2025-10-13T20:13:56Z)
Quantized but Deceptive? A Multi-Dimensional Truthfulness Evaluation of Quantized LLMs [29.9148172868873]
Quantization enables efficient deployment of large language models (LLMs) in resource-constrained environments.<n>We introduce TruthfulnessEval, a comprehensive evaluation framework for assessing the truthfulness of quantized LLMs.<n>We find that while quantized models internally retain truthful representations, they are more susceptible to producing false outputs under misleading prompts.
arXiv Detail & Related papers (2025-08-26T21:01:45Z)
Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks [31.379237532476875]
We investigate whether large language models (LLMs) encode truthfulness as a distinct linear feature, termed the "truth direction"<n>Our findings reveal that not all LLMs exhibit consistent truth directions, with stronger representations observed in more capable models.<n>We show that truthfulness probes trained on declarative atomic statements can generalize effectively to logical transformations, question-answering tasks, in-context learning, and external knowledge sources.
arXiv Detail & Related papers (2025-06-01T03:55:53Z)
Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling [56.26834106704781]
Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs)<n>We provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation.<n>Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers.
arXiv Detail & Related papers (2025-05-27T16:24:02Z)
Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLMs [48.202202256201815]
Factual hallucinations are a major challenge for Large Language Models (LLMs)<n>They undermine reliability and user trust by generating inaccurate or fabricated content.<n>Recent studies suggest that when generating false statements, the internal states of LLMs encode information about truthfulness.
arXiv Detail & Related papers (2025-05-22T11:00:53Z)
Calibrating Verbal Uncertainty as a Linear Feature to Reduce Hallucinations [51.92795774118647]
We find that verbal uncertainty'' is governed by a single linear feature in the representation space of LLMs.<n>We show that this has only moderate correlation with the actual semantic uncertainty'' of the model.
arXiv Detail & Related papers (2025-03-18T17:51:04Z)
MuLan: A Study of Fact Mutability in Language Models [50.626787909759976]
Trustworthy language models ideally identify mutable facts as such and process them accordingly. We create MuLan, a benchmark for evaluating the ability of English language models to anticipate time-contingency.
arXiv Detail & Related papers (2024-04-03T19:47:33Z)
Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness? [53.98071556805525]
Neural language models (LMs) can be used to evaluate the truth of factual statements. They can be queried for statement probabilities, or probed for internal representations of truthfulness. Past work has found that these two procedures sometimes disagree, and that probes tend to be more accurate than LM outputs. This has led some researchers to conclude that LMs "lie" or otherwise encode non-cooperative communicative intents.
arXiv Detail & Related papers (2023-11-27T18:59:14Z)
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets [6.732432949368421]
Large Language Models (LLMs) have impressive capabilities, but are prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. We present evidence that at sufficient scale, LLMs linearly represent the truth or falsehood of factual statements.
arXiv Detail & Related papers (2023-10-10T17:54:39Z)
The Internal State of an LLM Knows When It's Lying [18.886091925252174]
Large Language Models (LLMs) have shown exceptional performance in various tasks. One of their most prominent drawbacks is generating inaccurate or false information with a confident tone. We provide evidence that the LLM's internal state can be used to reveal the truthfulness of statements.
arXiv Detail & Related papers (2023-04-26T02:49:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.