Related papers: Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks

Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks

URL: http://arxiv.org/abs/2506.00823v1
Date: Sun, 01 Jun 2025 03:55:53 GMT
Title: Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks
Authors: Yuntai Bao, Xuhong Zhang, Tianyu Du, Xinkui Zhao, Zhengwen Feng, Hao Peng, Jianwei Yin,
Abstract summary: We investigate whether large language models (LLMs) encode truthfulness as a distinct linear feature, termed the "truth direction"<n>Our findings reveal that not all LLMs exhibit consistent truth directions, with stronger representations observed in more capable models.<n>We show that truthfulness probes trained on declarative atomic statements can generalize effectively to logical transformations, question-answering tasks, in-context learning, and external knowledge sources.
Score: 31.379237532476875
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are trained on extensive datasets that encapsulate substantial world knowledge. However, their outputs often include confidently stated inaccuracies. Earlier works suggest that LLMs encode truthfulness as a distinct linear feature, termed the "truth direction", which can classify truthfulness reliably. We address several open questions about the truth direction: (i) whether LLMs universally exhibit consistent truth directions; (ii) whether sophisticated probing techniques are necessary to identify truth directions; and (iii) how the truth direction generalizes across diverse contexts. Our findings reveal that not all LLMs exhibit consistent truth directions, with stronger representations observed in more capable models, particularly in the context of logical negation. Additionally, we demonstrate that truthfulness probes trained on declarative atomic statements can generalize effectively to logical transformations, question-answering tasks, in-context learning, and external knowledge sources. Finally, we explore the practical application of truthfulness probes in selective question-answering, illustrating their potential to improve user trust in LLM outputs. These results advance our understanding of truth directions and provide new insights into the internal representations of LLM beliefs. Our code is public at https://github.com/colored-dye/truthfulness_probe_generalization

Related papers

Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLMs [48.202202256201815]
Factual hallucinations are a major challenge for Large Language Models (LLMs)<n>They undermine reliability and user trust by generating inaccurate or fabricated content.<n>Recent studies suggest that when generating false statements, the internal states of LLMs encode information about truthfulness.
arXiv Detail & Related papers (2025-05-22T11:00:53Z)
Exploring the generalization of LLM truth directions on conversational formats [0.0]
We show that linear probes trained on a single hidden state of the model already generalize across a range of topics.<n>We find good generalization between short conversations that end on a lie, but poor generalization to longer formats where the lie appears earlier in the input prompt.
arXiv Detail & Related papers (2025-05-14T21:21:08Z)
Balancing Truthfulness and Informativeness with Uncertainty-Aware Instruction Fine-Tuning [79.48839334040197]
Instruction fine-tuning (IFT) can increase the informativeness of large language models (LLMs), but may reduce their truthfulness.<n>In this paper, we empirically demonstrate how unfamiliar knowledge in IFT datasets can negatively affect the truthfulness of LLMs.<n>We introduce two new IFT paradigms, $UNIT_cut$ and $UNIT_ref$, to address this issue.
arXiv Detail & Related papers (2025-02-17T16:10:30Z)
Scaling Laws for Fact Memorization of Large Language Models [67.94080978627363]
We analyze the scaling laws for Large Language Models' fact knowledge and their behaviors of memorizing different types of facts. We find that LLMs' fact knowledge capacity has a linear and negative exponential law relationship with model size and training epochs. Our findings reveal the capacity and characteristics of LLMs' fact knowledge learning, which provide directions for LLMs' fact knowledge augmentation.
arXiv Detail & Related papers (2024-06-22T03:32:09Z)
Enhanced Language Model Truthfulness with Learnable Intervention and Uncertainty Expression [19.69104070561701]
Large language models (LLMs) can generate long-form and coherent text, yet they often hallucinate facts. We propose LITO, a Learnable Intervention method for Truthfulness Optimization. Experiments on multiple LLMs and question-answering datasets demonstrate that LITO improves truthfulness while preserving task accuracy.
arXiv Detail & Related papers (2024-05-01T03:50:09Z)
Truth-Aware Context Selection: Mitigating Hallucinations of Large Language Models Being Misled by Untruthful Contexts [31.769428095250912]
Large Language Models (LLMs) are easily misled by untruthful contexts provided by users or knowledge augmentation tools. We propose Truth-Aware Context Selection (TACS) to adaptively recognize and mask untruthful context from the inputs. We show that TACS can effectively filter untruthful context and significantly improve the overall quality of LLMs' responses when presented with misleading information.
arXiv Detail & Related papers (2024-03-12T11:40:44Z)
TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space [31.769428095250912]
Large Language Models (LLMs) sometimes produce untruthful responses despite knowing the correct knowledge. We propose TruthX, an inference-time intervention method to activate the truthfulness of LLM.
arXiv Detail & Related papers (2024-02-27T14:45:04Z)
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs) As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z)
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets [6.732432949368421]
Large Language Models (LLMs) have impressive capabilities, but are prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. We present evidence that at sufficient scale, LLMs linearly represent the truth or falsehood of factual statements.
arXiv Detail & Related papers (2023-10-10T17:54:39Z)
Do Large Language Models Know about Facts? [60.501902866946]
Large language models (LLMs) have recently driven striking performance improvements across a range of natural language processing tasks. We aim to evaluate the extent and scope of factual knowledge within LLMs by designing the benchmark Pinocchio. Pinocchio contains 20K diverse factual questions that span different sources, timelines, domains, regions, and languages.
arXiv Detail & Related papers (2023-10-08T14:26:55Z)
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models [79.01926242857613]
Large language models (LLMs) are prone to hallucinations, generating content that deviates from facts seen during pretraining. We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs. We find that this Decoding by Contrasting Layers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts.
arXiv Detail & Related papers (2023-09-07T17:45:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.