Related papers: Truth is Universal: Robust Detection of Lies in LLMs

Truth is Universal: Robust Detection of Lies in LLMs

URL: http://arxiv.org/abs/2407.12831v1
Date: Wed, 3 Jul 2024 13:01:54 GMT
Title: Truth is Universal: Robust Detection of Lies in LLMs
Authors: Lennart Bürger, Fred A. Hamprecht, Boaz Nadler,
Abstract summary: Large Language Models (LLMs) have revolutionised natural language processing. LLMs are capable of "lying", knowingly outputting false statements. We develop a robust method to detect when an LLM is lying.
Score: 18.13311575803723
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have revolutionised natural language processing, exhibiting impressive human-like capabilities. In particular, LLMs are capable of "lying", knowingly outputting false statements. Hence, it is of interest and importance to develop methods to detect when LLMs lie. Indeed, several authors trained classifiers to detect LLM lies based on their internal model activations. However, other researchers showed that these classifiers may fail to generalise, for example to negated statements. In this work, we aim to develop a robust method to detect when an LLM is lying. To this end, we make the following key contributions: (i) We demonstrate the existence of a two-dimensional subspace, along which the activation vectors of true and false statements can be separated. Notably, this finding is universal and holds for various LLMs, including Gemma-7B, LLaMA2-13B and LLaMA3-8B. Our analysis explains the generalisation failures observed in previous studies and sets the stage for more robust lie detection; (ii) Building upon (i), we construct an accurate LLM lie detector. Empirically, our proposed classifier achieves state-of-the-art performance, distinguishing simple true and false statements with 94% accuracy and detecting more complex real-world lies with 95% accuracy.

Related papers

The Trilemma of Truth in Large Language Models [1.62933895796838]
We examine two common methods for probing the veracity of large language models (LLMs)<n>We introduce sAwMIL, a probing method that utilizes the internal activations of LLMs to separate statements into true, false, and neither.<n>We evaluate sAwMIL on 5 validity criteria across 16 open-source LLMs, including both default and chat-based variants, as well as on 3 new datasets.
arXiv Detail & Related papers (2025-06-30T14:49:28Z)
Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLMs [48.202202256201815]
Factual hallucinations are a major challenge for Large Language Models (LLMs)<n>They undermine reliability and user trust by generating inaccurate or fabricated content.<n>Recent studies suggest that when generating false statements, the internal states of LLMs encode information about truthfulness.
arXiv Detail & Related papers (2025-05-22T11:00:53Z)
"I know myself better, but not really greatly": Using LLMs to Detect and Explain LLM-Generated Texts [10.454446545249096]
Large language models (LLMs) have demonstrated impressive capabilities in generating human-like texts. This paper explores the detection and explanation capabilities of LLM-based detectors of human-generated texts.
arXiv Detail & Related papers (2025-02-18T11:00:28Z)
AI-LieDar: Examine the Trade-off Between Utility and Truthfulness in LLM Agents [27.10147264744531]
We study how language agents navigate scenarios with utility-truthfulness conflicts in a multi-turn interactive setting. We develop a truthfulness detector inspired by psychological literature to assess the agents' responses. Our experiment demonstrates that all models are truthful less than 50% of the time, although truthfulness and goal achievement (utility) rates vary across models.
arXiv Detail & Related papers (2024-09-13T17:41:12Z)
A Probabilistic Framework for LLM Hallucination Detection via Belief Tree Propagation [72.93327642336078]
We propose Belief Tree Propagation (BTProp), a probabilistic framework for hallucination detection. BTProp introduces a belief tree of logically related statements by decomposing a parent statement into child statements. Our method improves baselines by 3%-9% (evaluated by AUROC and AUC-PR) on multiple hallucination detection benchmarks.
arXiv Detail & Related papers (2024-06-11T05:21:37Z)
Potential and Limitations of LLMs in Capturing Structured Semantics: A Case Study on SRL [78.80673954827773]
Large Language Models (LLMs) play a crucial role in capturing structured semantics to enhance language understanding, improve interpretability, and reduce bias. We propose using Semantic Role Labeling (SRL) as a fundamental task to explore LLMs' ability to extract structured semantics. We find interesting potential: LLMs can indeed capture semantic structures, and scaling-up doesn't always mirror potential. We are surprised to discover that significant overlap in the errors is made by both LLMs and untrained humans, accounting for almost 30% of all errors.
arXiv Detail & Related papers (2024-05-10T11:44:05Z)
Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension [63.330262740414646]
We study how to characterize and predict the truthfulness of texts generated from large language models (LLMs) We suggest investigating internal activations and quantifying LLM's truthfulness using the local intrinsic dimension (LID) of model activations.
arXiv Detail & Related papers (2024-02-28T04:56:21Z)
Knowing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection Method [36.24876571343749]
Large Language Models (LLMs) have shown great potential in Natural Language Processing (NLP) tasks. Recent literature reveals that LLMs generate nonfactual responses intermittently. We propose a novel self-detection method to detect which questions that a LLM does not know that are prone to generate nonfactual results.
arXiv Detail & Related papers (2023-10-27T06:22:14Z)
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs) As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z)
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets [6.732432949368421]
Large Language Models (LLMs) have impressive capabilities, but are prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. We present evidence that at sufficient scale, LLMs linearly represent the truth or falsehood of factual statements.
arXiv Detail & Related papers (2023-10-10T17:54:39Z)
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions [34.53980255211931]
Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. Here, we develop a simple lie detector that requires neither access to the LLM's activations nor ground-truth knowledge of the fact in question. Despite its simplicity, this lie detector is highly accurate and surprisingly general.
arXiv Detail & Related papers (2023-09-26T16:07:54Z)
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models [79.01926242857613]
Large language models (LLMs) are prone to hallucinations, generating content that deviates from facts seen during pretraining. We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs. We find that this Decoding by Contrasting Layers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts.
arXiv Detail & Related papers (2023-09-07T17:45:31Z)
Red Teaming Language Model Detectors with Language Models [114.36392560711022]
Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users. Recent works have proposed algorithms to detect LLM-generated text and protect LLMs. We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
arXiv Detail & Related papers (2023-05-31T10:08:37Z)
The Internal State of an LLM Knows When It's Lying [18.886091925252174]
Large Language Models (LLMs) have shown exceptional performance in various tasks. One of their most prominent drawbacks is generating inaccurate or false information with a confident tone. We provide evidence that the LLM's internal state can be used to reveal the truthfulness of statements.
arXiv Detail & Related papers (2023-04-26T02:49:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.