When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
- URL: http://arxiv.org/abs/2506.04909v1
- Date: Thu, 05 Jun 2025 11:44:19 GMT
- Title: When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
- Authors: Kai Wang, Yihao Zhang, Meng Sun,
- Abstract summary: We study strategic deception in large language models (LLMs)<n>We induce, detect, and control such deception in CoT-enabled LLMs.<n>We achieve a 40% success rate in eliciting context-appropriate deception without explicit prompts.
- Score: 9.05950721565821
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The honesty of large language models (LLMs) is a critical alignment challenge, especially as advanced systems with chain-of-thought (CoT) reasoning may strategically deceive humans. Unlike traditional honesty issues on LLMs, which could be possibly explained as some kind of hallucination, those models' explicit thought paths enable us to study strategic deception--goal-driven, intentional misinformation where reasoning contradicts outputs. Using representation engineering, we systematically induce, detect, and control such deception in CoT-enabled LLMs, extracting "deception vectors" via Linear Artificial Tomography (LAT) for 89% detection accuracy. Through activation steering, we achieve a 40% success rate in eliciting context-appropriate deception without explicit prompts, unveiling the specific honesty-related issue of reasoning models and providing tools for trustworthy AI alignment.
Related papers
- Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z) - Cognitive Inception: Agentic Reasoning against Visual Deceptions by Injecting Skepticism [81.39177645864757]
We propose textbfInception, a fully reasoning-based agentic reasoning framework to conduct authenticity verification by injecting skepticism.<n>To the best of our knowledge, this is the first fully reasoning-based framework against AIGC visual deceptions.
arXiv Detail & Related papers (2025-11-21T05:13:30Z) - Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL [64.3268313484078]
Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare.<n>Their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns.<n>We investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception.
arXiv Detail & Related papers (2025-10-16T05:29:36Z) - From Literal to Liberal: A Meta-Prompting Framework for Eliciting Human-Aligned Exception Handling in Large Language Models [0.3946915822335988]
Large Language Models (LLMs) are increasingly being deployed as the reasoning engines for agentic AI systems.<n>They exhibit a critical flaw: a rigid adherence to explicit rules that leads to decisions misaligned with human common sense and intent.<n>We introduce the Rule-Intent Distinction (RID) Framework, which elicits human-aligned exception handling in LLMs in a zero-shot manner.
arXiv Detail & Related papers (2025-10-14T16:42:52Z) - Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts [79.1081247754018]
Large Language Models (LLMs) are widely deployed in reasoning, planning, and decision-making tasks.<n>We propose a framework based on Contact Searching Questions(CSQ) to quantify the likelihood of deception.
arXiv Detail & Related papers (2025-08-08T14:46:35Z) - Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems [0.0]
We show that language models can generate deceptive explanations that evade detection.<n>Our agents employ steganographic methods to hide information in seemingly innocent explanations.<n>All tested LLM agents were capable of deceiving the overseer while achieving high interpretability scores comparable to those of reference labels.
arXiv Detail & Related papers (2025-04-10T15:07:10Z) - Cognitive Debiasing Large Language Models for Decision-Making [71.2409973056137]
Large language models (LLMs) have shown potential in supporting decision-making applications.<n>We propose a cognitive debiasing approach, self-adaptive cognitive debiasing (SACD)<n>Our method follows three sequential steps -- bias determination, bias analysis, and cognitive debiasing -- to iteratively mitigate potential cognitive biases in prompts.
arXiv Detail & Related papers (2025-04-05T11:23:05Z) - "Well, Keep Thinking": Enhancing LLM Reasoning with Adaptive Injection Decoding [4.008780119020479]
Large language models (LLMs) exhibit strong reasoning abilities, often attributed to few-shot or zero-shot chain-of-thought (CoT) prompting.<n>We propose a novel decoding strategy that systematically nudges LLMs to continue reasoning, thereby preventing immature reasoning processes.
arXiv Detail & Related papers (2025-03-13T08:46:32Z) - Fooling LLM graders into giving better grades through neural activity guided adversarial prompting [26.164839501935973]
We propose a systematic method to reveal such biases in AI evaluation systems.<n>Our approach first identifies hidden neural activity patterns that predict distorted decision outcomes.<n>We demonstrate that this combination can effectively fool large language model graders into assigning much higher grades than humans would.
arXiv Detail & Related papers (2024-12-17T19:08:22Z) - Cognitive LLMs: Towards Integrating Cognitive Architectures and Large Language Models for Manufacturing Decision-making [51.737762570776006]
LLM-ACTR is a novel neuro-symbolic architecture that provides human-aligned and versatile decision-making.
Our framework extracts and embeds knowledge of ACT-R's internal decision-making process as latent neural representations.
Our experiments on novel Design for Manufacturing tasks show both improved task performance as well as improved grounded decision-making capability.
arXiv Detail & Related papers (2024-08-17T11:49:53Z) - Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - A Causal Explainable Guardrails for Large Language Models [29.441292837667415]
Large Language Models (LLMs) have shown impressive performance in natural language tasks, but their outputs can exhibit undesirable attributes or biases.
We propose LLMGuardrail, a novel framework that incorporates causal analysis and adversarial learning to obtain unbiased steering representations.
arXiv Detail & Related papers (2024-05-07T09:55:05Z) - Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z) - Tuning-Free Accountable Intervention for LLM Deployment -- A
Metacognitive Approach [55.613461060997004]
Large Language Models (LLMs) have catalyzed transformative advances across a spectrum of natural language processing tasks.
We propose an innovative textitmetacognitive approach, dubbed textbfCLEAR, to equip LLMs with capabilities for self-aware error identification and correction.
arXiv Detail & Related papers (2024-03-08T19:18:53Z) - State Machine of Thoughts: Leveraging Past Reasoning Trajectories for
Enhancing Problem Solving [6.198707341858042]
We use a state machine to record experience derived from previous reasoning trajectories.
Within the state machine, states represent decomposed sub-problems, while state transitions reflect the dependencies among sub-problems.
Our proposed State Machine of Thoughts (SMoT) selects the most optimal sub-solutions and avoids incorrect ones.
arXiv Detail & Related papers (2023-12-29T03:00:04Z) - A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning.
Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z) - Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models [56.34029644009297]
Large language models (LLMs) have demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems.
LLMs excel most in abductive reasoning, followed by deductive reasoning, while they are least effective at inductive reasoning.
We study single-task training, multi-task training, and "chain-of-thought" knowledge distillation fine-tuning technique to assess the performance of model.
arXiv Detail & Related papers (2023-10-02T01:00:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.