Related papers: Emergence of Linear Truth Encodings in Language Models

Emergence of Linear Truth Encodings in Language Models

URL: http://arxiv.org/abs/2510.15804v1
Date: Fri, 17 Oct 2025 16:30:07 GMT
Title: Emergence of Linear Truth Encodings in Language Models
Authors: Shauli Ravfogel, Gilad Yehudai, Tal Linzen, Joan Bruna, Alberto Bietti,
Abstract summary: Large language models exhibit linear subspaces that separate true from false statements, yet the mechanism behind their emergence is unclear.<n>We introduce a transparent, one-layer transformer toy model that reproduces such truth subspaces end-to-end.<n>We study one simple setting in which truth encoding can emerge, encouraging the model to learn this distinction in order to lower the LM loss on future tokens.
Score: 64.86571541830598
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent probing studies reveal that large language models exhibit linear subspaces that separate true from false statements, yet the mechanism behind their emergence is unclear. We introduce a transparent, one-layer transformer toy model that reproduces such truth subspaces end-to-end and exposes one concrete route by which they can arise. We study one simple setting in which truth encoding can emerge: a data distribution where factual statements co-occur with other factual statements (and vice-versa), encouraging the model to learn this distinction in order to lower the LM loss on future tokens. We corroborate this pattern with experiments in pretrained language models. Finally, in the toy setting we observe a two-phase learning dynamic: networks first memorize individual factual associations in a few steps, then -- over a longer horizon -- learn to linearly separate true from false, which in turn lowers language-modeling loss. Together, these results provide both a mechanistic demonstration and an empirical motivation for how and why linear truth representations can emerge in language models.

Related papers

Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics [56.145578792496714]
Large language models (LLMs) struggle with cross-lingual knowledge transfer.<n>We study the causes and dynamics of this phenomenon by training small Transformer models from scratch on synthetic multilingual datasets.
arXiv Detail & Related papers (2025-08-14T18:44:13Z)
A Markov Categorical Framework for Language Modeling [9.910562011343009]
Autoregressive language models achieve remarkable performance, yet a unified theory explaining their internal mechanisms, how training shapes their representations, and enables complex behaviors, remains elusive.<n>We introduce a new analytical framework that models the single-step generation process as a composition of information-processing stages using the language of Markov categories.<n>This work presents a powerful new lens for understanding how information flows through a model and how the training objective shapes its internal geometry.
arXiv Detail & Related papers (2025-07-25T13:14:03Z)
From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs [3.6485741522018724]
Large Language Models (LLMs) exhibit strong conversational abilities but often generate falsehoods.<n>We extend the concept cone framework, recently introduced for modeling refusal, to the domain of truth.<n>We identify multi-dimensional cones that causally mediate truth-related behavior across multiple LLM families.
arXiv Detail & Related papers (2025-05-27T22:14:54Z)
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [76.15163242945813]
Large language models (LLMs) have led many to conclude that they exhibit a form of intelligence.<n>We introduce a novel generative model that generates tokens on the basis of human-interpretable concepts represented as latent discrete variables.
arXiv Detail & Related papers (2025-03-12T01:21:17Z)
On the Origins of Linear Representations in Large Language Models [51.88404605700344]
We introduce a simple latent variable model to formalize the concept dynamics of the next token prediction. Experiments show that linear representations emerge when learning from data matching the latent variable model. We additionally confirm some predictions of the theory using the LLaMA-2 large language model.
arXiv Detail & Related papers (2024-03-06T17:17:36Z)
Personas as a Way to Model Truthfulness in Language Models [23.86655844340011]
Large language models (LLMs) are trained on vast amounts of text from the internet. This paper presents an explanation for why LMs appear to know the truth despite not being trained with truth labels.
arXiv Detail & Related papers (2023-10-27T14:27:43Z)
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets [6.732432949368421]
Large Language Models (LLMs) have impressive capabilities, but are prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. We present evidence that at sufficient scale, LLMs linearly represent the truth or falsehood of factual statements.
arXiv Detail & Related papers (2023-10-10T17:54:39Z)
Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth. We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z)
Interpreting Language Models with Contrastive Explanations [99.7035899290924]
Language models must consider various features to predict a token, such as its part of speech, number, tense, or semantics. Existing explanation methods conflate evidence for all these features into a single explanation, which is less interpretable for human understanding. We show that contrastive explanations are quantifiably better than non-contrastive explanations in verifying major grammatical phenomena.
arXiv Detail & Related papers (2022-02-21T18:32:24Z)
Leap-Of-Thought: Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge [96.92252296244233]
Large pre-trained language models (LMs) acquire some reasoning capacity, but this ability is difficult to control. We show that LMs can be trained to reliably perform systematic reasoning combining both implicit, pre-trained knowledge and explicit natural language statements. Our work paves a path towards open-domain systems that constantly improve by interacting with users who can instantly correct a model by adding simple natural language statements.
arXiv Detail & Related papers (2020-06-11T17:02:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.