Related papers: Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching

Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching

URL: http://arxiv.org/abs/2311.15131v1
Date: Sat, 25 Nov 2023 22:41:23 GMT
Title: Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching
Authors: James Campbell, Richard Ren, Phillip Guo
Abstract summary: Large language models (LLMs) demonstrate significant knowledge through their outputs, though it is often unclear whether false outputs are due to a lack of knowledge or dishonesty. In this paper, we investigate instructed dishonesty, wherein we explicitly prompt LLaMA-2-70b-chat to lie. We perform prompt engineering to find which prompts best induce lying behavior, and then use mechanistic interpretability approaches to localize where in the network this behavior occurs.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) demonstrate significant knowledge through their outputs, though it is often unclear whether false outputs are due to a lack of knowledge or dishonesty. In this paper, we investigate instructed dishonesty, wherein we explicitly prompt LLaMA-2-70b-chat to lie. We perform prompt engineering to find which prompts best induce lying behavior, and then use mechanistic interpretability approaches to localize where in the network this behavior occurs. Using linear probing and activation patching, we localize five layers that appear especially important for lying. We then find just 46 attention heads within these layers that enable us to causally intervene such that the lying model instead answers honestly. We show that these interventions work robustly across many prompts and dataset splits. Overall, our work contributes a greater understanding of dishonesty in LLMs so that we may hope to prevent it.

Related papers

Large Language Models as Misleading Assistants in Conversation [8.557086720583802]
We investigate the ability of Large Language Models (LLMs) to be deceptive in the context of providing assistance on a reading comprehension task. We compare outcomes of (1) when the model is prompted to provide truthful assistance, (2) when it is prompted to be subtly misleading, and (3) when it is prompted to argue for an incorrect answer.
arXiv Detail & Related papers (2024-07-16T14:45:22Z)
Truth is Universal: Robust Detection of Lies in LLMs [18.13311575803723]
Large Language Models (LLMs) have revolutionised natural language processing, exhibiting impressive human-like capabilities. In this work, we aim to develop a robust method to detect when an LLM is lying. We demonstrate the existence of a two-dimensional subspace, along which the activation vectors of true and false statements can be separated. This finding is universal and holds for various LLMs, including Gemma-7B, LLaMA2-13B, Mistral-7B and LLaMA3-8B. Our analysis explains the generalisation failures observed in previous studies and sets the stage for more
arXiv Detail & Related papers (2024-07-03T13:01:54Z)
Teaching Large Language Models to Express Knowledge Boundary from Their Own Signals [53.273592543786705]
Large language models (LLMs) have achieved great success, but their occasional content fabrication, or hallucination, limits their practical application. We propose CoKE, which first probes LLMs' knowledge boundary via internal confidence given a set of questions, and then leverages the probing results to elicit the expression of the knowledge boundary.
arXiv Detail & Related papers (2024-06-16T10:07:20Z)
Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs [28.58726732808416]
We employ the Greedy Coordinate Gradient to craft prompts that compel large language models to generate coherent responses from seemingly nonsensical inputs. We find that the manipulation efficiency depends on the target text's length and perplexity, with the Babel prompts often located in lower loss minima. Notably, we find that guiding the model to generate harmful texts is not more difficult than into generating benign texts, suggesting lack of alignment for out-of-distribution prompts.
arXiv Detail & Related papers (2024-04-26T02:29:26Z)
The Earth is Flat? Unveiling Factual Errors in Large Language Models [89.94270049334479]
Large Language Models (LLMs) like ChatGPT are in various applications due to their extensive knowledge from pre-training and fine-tuning. Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education. We introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs.
arXiv Detail & Related papers (2024-01-01T14:02:27Z)
Deceptive Semantic Shortcuts on Reasoning Chains: How Far Can Models Go without Hallucination? [73.454943870226]
This work studies a specific type of hallucination induced by semantic associations. To quantify this phenomenon, we propose a novel probing method and benchmark called EureQA.
arXiv Detail & Related papers (2023-11-16T09:27:36Z)
The Perils & Promises of Fact-checking with Large Language Models [55.869584426820715]
Large Language Models (LLMs) are increasingly trusted to write academic papers, lawsuits, and news articles. We evaluate the use of LLM agents in fact-checking by having them phrase queries, retrieve contextual data, and make decisions. Our results show the enhanced prowess of LLMs when equipped with contextual information. While LLMs show promise in fact-checking, caution is essential due to inconsistent accuracy.
arXiv Detail & Related papers (2023-10-20T14:49:47Z)
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets [6.732432949368421]
Large Language Models (LLMs) have impressive capabilities, but are prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. We present evidence that at sufficient scale, LLMs linearly represent the truth or falsehood of factual statements.
arXiv Detail & Related papers (2023-10-10T17:54:39Z)
Do Large Language Models Know about Facts? [60.501902866946]
Large language models (LLMs) have recently driven striking performance improvements across a range of natural language processing tasks. We aim to evaluate the extent and scope of factual knowledge within LLMs by designing the benchmark Pinocchio. Pinocchio contains 20K diverse factual questions that span different sources, timelines, domains, regions, and languages.
arXiv Detail & Related papers (2023-10-08T14:26:55Z)
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions [34.53980255211931]
Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. Here, we develop a simple lie detector that requires neither access to the LLM's activations nor ground-truth knowledge of the fact in question. Despite its simplicity, this lie detector is highly accurate and surprisingly general.
arXiv Detail & Related papers (2023-09-26T16:07:54Z)
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models [79.01926242857613]
Large language models (LLMs) are prone to hallucinations, generating content that deviates from facts seen during pretraining. We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs. We find that this Decoding by Contrasting Layers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts.
arXiv Detail & Related papers (2023-09-07T17:45:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.