Exploring the generalization of LLM truth directions on conversational formats
- URL: http://arxiv.org/abs/2505.09807v1
- Date: Wed, 14 May 2025 21:21:08 GMT
- Title: Exploring the generalization of LLM truth directions on conversational formats
- Authors: Timour Ichmoukhamedov, David Martens,
- Abstract summary: We show that linear probes trained on a single hidden state of the model already generalize across a range of topics.<n>We find good generalization between short conversations that end on a lie, but poor generalization to longer formats where the lie appears earlier in the input prompt.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several recent works argue that LLMs have a universal truth direction where true and false statements are linearly separable in the activation space of the model. It has been demonstrated that linear probes trained on a single hidden state of the model already generalize across a range of topics and might even be used for lie detection in LLM conversations. In this work we explore how this truth direction generalizes between various conversational formats. We find good generalization between short conversations that end on a lie, but poor generalization to longer formats where the lie appears earlier in the input prompt. We propose a solution that significantly improves this type of generalization by adding a fixed key phrase at the end of each conversation. Our results highlight the challenges towards reliable LLM lie detectors that generalize to new settings.
Related papers
- Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks [31.379237532476875]
We investigate whether large language models (LLMs) encode truthfulness as a distinct linear feature, termed the "truth direction"<n>Our findings reveal that not all LLMs exhibit consistent truth directions, with stronger representations observed in more capable models.<n>We show that truthfulness probes trained on declarative atomic statements can generalize effectively to logical transformations, question-answering tasks, in-context learning, and external knowledge sources.
arXiv Detail & Related papers (2025-06-01T03:55:53Z) - Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLMs [48.202202256201815]
Factual hallucinations are a major challenge for Large Language Models (LLMs)<n>They undermine reliability and user trust by generating inaccurate or fabricated content.<n>Recent studies suggest that when generating false statements, the internal states of LLMs encode information about truthfulness.
arXiv Detail & Related papers (2025-05-22T11:00:53Z) - LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations [46.351064535592336]
Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures.
Recent studies have demonstrated that LLMs' internal states encode information regarding the truthfulness of their outputs.
We show that the internal representations of LLMs encode much more information about truthfulness than previously recognized.
arXiv Detail & Related papers (2024-10-03T17:31:31Z) - Truth is Universal: Robust Detection of Lies in LLMs [18.13311575803723]
Large Language Models (LLMs) have revolutionised natural language processing, exhibiting impressive human-like capabilities.
In this work, we aim to develop a robust method to detect when an LLM is lying.
We demonstrate the existence of a two-dimensional subspace, along which the activation vectors of true and false statements can be separated.
This finding is universal and holds for various LLMs, including Gemma-7B, LLaMA2-13B, Mistral-7B and LLaMA3-8B.
Our analysis explains the generalisation failures observed in previous studies and sets the stage for more
arXiv Detail & Related papers (2024-07-03T13:01:54Z) - Scaling Laws for Fact Memorization of Large Language Models [67.94080978627363]
We analyze the scaling laws for Large Language Models' fact knowledge and their behaviors of memorizing different types of facts.
We find that LLMs' fact knowledge capacity has a linear and negative exponential law relationship with model size and training epochs.
Our findings reveal the capacity and characteristics of LLMs' fact knowledge learning, which provide directions for LLMs' fact knowledge augmentation.
arXiv Detail & Related papers (2024-06-22T03:32:09Z) - Enhanced Language Model Truthfulness with Learnable Intervention and Uncertainty Expression [19.69104070561701]
Large language models (LLMs) can generate long-form and coherent text, yet they often hallucinate facts.
We propose LITO, a Learnable Intervention method for Truthfulness Optimization.
Experiments on multiple LLMs and question-answering datasets demonstrate that LITO improves truthfulness while preserving task accuracy.
arXiv Detail & Related papers (2024-05-01T03:50:09Z) - The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets [6.732432949368421]
Large Language Models (LLMs) have impressive capabilities, but are prone to outputting falsehoods.
Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations.
We present evidence that at sufficient scale, LLMs linearly represent the truth or falsehood of factual statements.
arXiv Detail & Related papers (2023-10-10T17:54:39Z) - Do Large Language Models Know about Facts? [60.501902866946]
Large language models (LLMs) have recently driven striking performance improvements across a range of natural language processing tasks.
We aim to evaluate the extent and scope of factual knowledge within LLMs by designing the benchmark Pinocchio.
Pinocchio contains 20K diverse factual questions that span different sources, timelines, domains, regions, and languages.
arXiv Detail & Related papers (2023-10-08T14:26:55Z) - FreshLLMs: Refreshing Large Language Models with Search Engine
Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge.
We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types.
We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination.
Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z) - How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking
Unrelated Questions [34.53980255211931]
Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense.
Here, we develop a simple lie detector that requires neither access to the LLM's activations nor ground-truth knowledge of the fact in question.
Despite its simplicity, this lie detector is highly accurate and surprisingly general.
arXiv Detail & Related papers (2023-09-26T16:07:54Z) - DoLa: Decoding by Contrasting Layers Improves Factuality in Large
Language Models [79.01926242857613]
Large language models (LLMs) are prone to hallucinations, generating content that deviates from facts seen during pretraining.
We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs.
We find that this Decoding by Contrasting Layers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts.
arXiv Detail & Related papers (2023-09-07T17:45:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.