Related papers: Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

URL: http://arxiv.org/abs/2603.05494v1
Date: Thu, 05 Mar 2026 18:58:14 GMT
Title: Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
Authors: Helena Casademunt, Bartosz Cywiński, Khoi Tran, Arya Jakkli, Samuel Marks, Neel Nanda,
Abstract summary: Two approaches to this problem are honesty elicitation and lie detection.<n>We study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics.<n>For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses.
Score: 10.262565099386702
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.

Related papers

Liars' Bench: Evaluating Lie Detectors for Language Models [3.227579417498381]
We introduce LIARS' BENCH, a testbed consisting of 72,863 examples of lies and honest responses generated by open-weight models.<n>Our settings capture qualitatively different types of lies and vary along two dimensions: the model's reason for lying and the object of belief targeted by the lie.
arXiv Detail & Related papers (2025-11-20T04:29:33Z)
Multi-Modal Fact-Verification Framework for Reducing Hallucinations in Large Language Models [0.0]
Large Language Models generate false information that sounds plausible.<n>This hallucination problem has become a major barrier to deploying these models in real-world applications.<n>We develop a fact verification framework that catches and corrects these errors in real-time.
arXiv Detail & Related papers (2025-10-26T16:58:54Z)
But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors [0.0]
Judge Using Safety-Steered Alternatives (JUSSA) is a framework that employs steering vectors during inference to generate more honest alternatives.<n>We evaluate JUSSA on sycophancy detection and introduce a new manipulation dataset covering multiple types of manipulation.<n>Our work opens new directions for scalable model auditing as systems become increasingly sophisticated.
arXiv Detail & Related papers (2025-05-23T11:34:02Z)
LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models [69.68379406317682]
We introduce a listener-aware finetuning method (LACIE) to calibrate implicit and explicit confidence markers. We show that LACIE models the listener, considering not only whether an answer is right, but whether it will be accepted by a listener. We find that training with LACIE results in 47% fewer incorrect answers being accepted while maintaining the same level of acceptance for correct answers.
arXiv Detail & Related papers (2024-05-31T17:16:38Z)
The Earth is Flat? Unveiling Factual Errors in Large Language Models [89.94270049334479]
Large Language Models (LLMs) like ChatGPT are in various applications due to their extensive knowledge from pre-training and fine-tuning. Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education. We introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs.
arXiv Detail & Related papers (2024-01-01T14:02:27Z)
Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching [0.0]
Large language models (LLMs) demonstrate significant knowledge through their outputs, though it is often unclear whether false outputs are due to a lack of knowledge or dishonesty. In this paper, we investigate instructed dishonesty, wherein we explicitly prompt LLaMA-2-70b-chat to lie. We perform prompt engineering to find which prompts best induce lying behavior, and then use mechanistic interpretability approaches to localize where in the network this behavior occurs.
arXiv Detail & Related papers (2023-11-25T22:41:23Z)
Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge [35.067234242461545]
Large language models (LLMs) express uncertainty in situations where they lack sufficient parametric knowledge to generate reasonable responses. This work aims to systematically investigate LLMs' behaviors in such situations, emphasizing the trade-off between honesty and helpfulness.
arXiv Detail & Related papers (2023-11-16T10:02:40Z)
Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations' In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z)
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge. We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z)
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions [34.53980255211931]
Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. Here, we develop a simple lie detector that requires neither access to the LLM's activations nor ground-truth knowledge of the fact in question. Despite its simplicity, this lie detector is highly accurate and surprisingly general.
arXiv Detail & Related papers (2023-09-26T16:07:54Z)
Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth. We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z)
FaVIQ: FAct Verification from Information-seeking Questions [77.7067957445298]
We construct a large-scale fact verification dataset called FaVIQ using information-seeking questions posed by real users. Our claims are verified to be natural, contain little lexical bias, and require a complete understanding of the evidence for verification.
arXiv Detail & Related papers (2021-07-05T17:31:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.