How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking
Unrelated Questions
- URL: http://arxiv.org/abs/2309.15840v1
- Date: Tue, 26 Sep 2023 16:07:54 GMT
- Title: How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking
Unrelated Questions
- Authors: Lorenzo Pacchiardi, Alex J. Chan, S\"oren Mindermann, Ilan Moscovitz,
Alexa Y. Pan, Yarin Gal, Owain Evans, Jan Brauner
- Abstract summary: Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense.
Here, we develop a simple lie detector that requires neither access to the LLM's activations nor ground-truth knowledge of the fact in question.
Despite its simplicity, this lie detector is highly accurate and surprisingly general.
- Score: 34.53980255211931
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) can "lie", which we define as outputting false
statements despite "knowing" the truth in a demonstrable sense. LLMs might
"lie", for example, when instructed to output misinformation. Here, we develop
a simple lie detector that requires neither access to the LLM's activations
(black-box) nor ground-truth knowledge of the fact in question. The detector
works by asking a predefined set of unrelated follow-up questions after a
suspected lie, and feeding the LLM's yes/no answers into a logistic regression
classifier. Despite its simplicity, this lie detector is highly accurate and
surprisingly general. When trained on examples from a single setting --
prompting GPT-3.5 to lie about factual questions -- the detector generalises
out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie,
(3) sycophantic lies, and (4) lies emerging in real-life scenarios such as
sales. These results indicate that LLMs have distinctive lie-related
behavioural patterns, consistent across architectures and contexts, which could
enable general-purpose lie detection.
Related papers
- Are LLMs Aware that Some Questions are not Open-ended? [58.93124686141781]
We study whether Large Language Models are aware that some questions have limited answers and need to respond more deterministically.
The lack of question awareness in LLMs leads to two phenomena: (1) too casual to answer non-open-ended questions or (2) too boring to answer open-ended questions.
arXiv Detail & Related papers (2024-10-01T06:07:00Z) - Truth is Universal: Robust Detection of Lies in LLMs [18.13311575803723]
Large Language Models (LLMs) have revolutionised natural language processing, exhibiting impressive human-like capabilities.
In this work, we aim to develop a robust method to detect when an LLM is lying.
We demonstrate the existence of a two-dimensional subspace, along which the activation vectors of true and false statements can be separated.
This finding is universal and holds for various LLMs, including Gemma-7B, LLaMA2-13B, Mistral-7B and LLaMA3-8B.
Our analysis explains the generalisation failures observed in previous studies and sets the stage for more
arXiv Detail & Related papers (2024-07-03T13:01:54Z) - Scaling Laws for Fact Memorization of Large Language Models [67.94080978627363]
We analyze the scaling laws for Large Language Models' fact knowledge and their behaviors of memorizing different types of facts.
We find that LLMs' fact knowledge capacity has a linear and negative exponential law relationship with model size and training epochs.
Our findings reveal the capacity and characteristics of LLMs' fact knowledge learning, which provide directions for LLMs' fact knowledge augmentation.
arXiv Detail & Related papers (2024-06-22T03:32:09Z) - A Probabilistic Framework for LLM Hallucination Detection via Belief Tree Propagation [72.93327642336078]
We propose Belief Tree Propagation (BTProp), a probabilistic framework for hallucination detection.
BTProp introduces a belief tree of logically related statements by decomposing a parent statement into child statements.
Our method improves baselines by 3%-9% (evaluated by AUROC and AUC-PR) on multiple hallucination detection benchmarks.
arXiv Detail & Related papers (2024-06-11T05:21:37Z) - The Earth is Flat? Unveiling Factual Errors in Large Language Models [89.94270049334479]
Large Language Models (LLMs) like ChatGPT are in various applications due to their extensive knowledge from pre-training and fine-tuning.
Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education.
We introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs.
arXiv Detail & Related papers (2024-01-01T14:02:27Z) - Localizing Lying in Llama: Understanding Instructed Dishonesty on
True-False Questions Through Prompting, Probing, and Patching [0.0]
Large language models (LLMs) demonstrate significant knowledge through their outputs, though it is often unclear whether false outputs are due to a lack of knowledge or dishonesty.
In this paper, we investigate instructed dishonesty, wherein we explicitly prompt LLaMA-2-70b-chat to lie.
We perform prompt engineering to find which prompts best induce lying behavior, and then use mechanistic interpretability approaches to localize where in the network this behavior occurs.
arXiv Detail & Related papers (2023-11-25T22:41:23Z) - Knowing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection Method [36.24876571343749]
Large Language Models (LLMs) have shown great potential in Natural Language Processing (NLP) tasks.
Recent literature reveals that LLMs generate nonfactual responses intermittently.
We propose a novel self-detection method to detect which questions that a LLM does not know that are prone to generate nonfactual results.
arXiv Detail & Related papers (2023-10-27T06:22:14Z) - The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets [6.732432949368421]
Large Language Models (LLMs) have impressive capabilities, but are prone to outputting falsehoods.
Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations.
We present evidence that at sufficient scale, LLMs linearly represent the truth or falsehood of factual statements.
arXiv Detail & Related papers (2023-10-10T17:54:39Z) - Statistical Knowledge Assessment for Large Language Models [79.07989821512128]
Given varying prompts regarding a factoid question, can a large language model (LLM) reliably generate factually correct answers?
We propose KaRR, a statistical approach to assess factual knowledge for LLMs.
Our results reveal that the knowledge in LLMs with the same backbone architecture adheres to the scaling law, while tuning on instruction-following data sometimes compromises the model's capability to generate factually correct text reliably.
arXiv Detail & Related papers (2023-05-17T18:54:37Z) - The Internal State of an LLM Knows When It's Lying [18.886091925252174]
Large Language Models (LLMs) have shown exceptional performance in various tasks.
One of their most prominent drawbacks is generating inaccurate or false information with a confident tone.
We provide evidence that the LLM's internal state can be used to reveal the truthfulness of statements.
arXiv Detail & Related papers (2023-04-26T02:49:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.