Liars' Bench: Evaluating Lie Detectors for Language Models
- URL: http://arxiv.org/abs/2511.16035v1
- Date: Thu, 20 Nov 2025 04:29:33 GMT
- Title: Liars' Bench: Evaluating Lie Detectors for Language Models
- Authors: Kieron Kretschmar, Walter Laurito, Sharan Maiya, Samuel Marks,
- Abstract summary: We introduce LIARS' BENCH, a testbed consisting of 72,863 examples of lies and honest responses generated by open-weight models.<n>Our settings capture qualitatively different types of lies and vary along two dimensions: the model's reason for lying and the object of belief targeted by the lie.
- Score: 3.227579417498381
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Prior work has introduced techniques for detecting when large language models (LLMs) lie, that is, generating statements they believe are false. However, these techniques are typically validated in narrow settings that do not capture the diverse lies LLMs can generate. We introduce LIARS' BENCH, a testbed consisting of 72,863 examples of lies and honest responses generated by four open-weight models across seven datasets. Our settings capture qualitatively different types of lies and vary along two dimensions: the model's reason for lying and the object of belief targeted by the lie. Evaluating three black- and white-box lie detection techniques on LIARS' BENCH, we find that existing techniques systematically fail to identify certain types of lies, especially in settings where it's not possible to determine whether the model lied from the transcript alone. Overall, LIARS' BENCH reveals limitations in prior techniques and provides a practical testbed for guiding progress in lie detection.
Related papers
- Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation [10.262565099386702]
Two approaches to this problem are honesty elicitation and lie detection.<n>We study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics.<n>For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses.
arXiv Detail & Related papers (2026-03-05T18:58:14Z) - RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns [50.401907401444404]
Large language models (LLMs) are crucial for preventing misuse and building trustworthy AI systems.<n>We propose RepreGuard, an efficient statistics-based detection method.<n> Experimental results show that RepreGuard outperforms all baselines with average 94.92% AUROC on both in-distribution (ID) and OOD scenarios.
arXiv Detail & Related papers (2025-08-18T17:59:15Z) - When lies are mostly truthful: automated verbal deception detection for embedded lies [0.3867363075280544]
We collected a novel dataset of 2,088 truthful and deceptive statements with annotated embedded lies.<n>We show that a fined-tuned language model (Llama-3-8B) can classify truthful statements and those containing embedded lies with 64% accuracy.
arXiv Detail & Related papers (2025-01-13T11:16:05Z) - Truth is Universal: Robust Detection of Lies in LLMs [18.13311575803723]
Large Language Models (LLMs) have revolutionised natural language processing, exhibiting impressive human-like capabilities.
In this work, we aim to develop a robust method to detect when an LLM is lying.
We demonstrate the existence of a two-dimensional subspace, along which the activation vectors of true and false statements can be separated.
This finding is universal and holds for various LLMs, including Gemma-7B, LLaMA2-13B, Mistral-7B and LLaMA3-8B.
Our analysis explains the generalisation failures observed in previous studies and sets the stage for more
arXiv Detail & Related papers (2024-07-03T13:01:54Z) - Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification [116.77055746066375]
Large language models (LLMs) are notorious for hallucinating, i.e., producing erroneous claims in their output.
We propose a novel fact-checking and hallucination detection pipeline based on token-level uncertainty quantification.
arXiv Detail & Related papers (2024-03-07T17:44:17Z) - The Earth is Flat? Unveiling Factual Errors in Large Language Models [89.94270049334479]
Large Language Models (LLMs) like ChatGPT are in various applications due to their extensive knowledge from pre-training and fine-tuning.
Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education.
We introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs.
arXiv Detail & Related papers (2024-01-01T14:02:27Z) - Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines.
Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations'
In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z) - How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking
Unrelated Questions [34.53980255211931]
Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense.
Here, we develop a simple lie detector that requires neither access to the LLM's activations nor ground-truth knowledge of the fact in question.
Despite its simplicity, this lie detector is highly accurate and surprisingly general.
arXiv Detail & Related papers (2023-09-26T16:07:54Z) - A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of
LLMs by Validating Low-Confidence Generation [76.34411067299331]
Large language models often tend to 'hallucinate' which critically hampers their reliability.
We propose an approach that actively detects and mitigates hallucinations during the generation process.
We show that the proposed active detection and mitigation approach successfully reduces the hallucinations of the GPT-3.5 model from 47.5% to 14.5% on average.
arXiv Detail & Related papers (2023-07-08T14:25:57Z) - Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth.
We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way.
We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z) - Machine Learning based Lie Detector applied to a Collected and Annotated
Dataset [1.3007851628964147]
We have collected a dataset that contains annotated images and 3D information of different participants faces during a card game that incentivises the lying.
Using our collected dataset, we evaluated several types of machine learning based lie detector through generalize, personal and cross lie experiments.
In these experiments, we showed the superiority of deep learning based model in recognizing the lie with best accuracy of 57% for generalized task and 63% when dealing with a single participant.
arXiv Detail & Related papers (2021-04-26T04:48:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.