Related papers: The Facade of Truth: Uncovering and Mitigating LLM Susceptibility to Deceptive Evidence

The Facade of Truth: Uncovering and Mitigating LLM Susceptibility to Deceptive Evidence

URL: http://arxiv.org/abs/2601.05478v1
Date: Fri, 09 Jan 2026 02:28:00 GMT
Title: The Facade of Truth: Uncovering and Mitigating LLM Susceptibility to Deceptive Evidence
Authors: Herun Wan, Jiaying Wu, Minnan Luo, Fanxiao Li, Zhi Zeng, Min-Yen Kan,
Abstract summary: We introduce MisBelief, a framework that generates misleading evidence via collaborative, multi-round interactions.<n>Using MisBelief, we generate 4,800 instances across three difficulty levels to evaluate 7 representative LLMs.<n>Results indicate that while models are robust to direct misinformation, they are highly sensitive to this refined evidence.<n>We propose Deceptive Intent Shielding (DIS), a governance mechanism that provides an early warning signal by inferring the deceptive intent behind evidence.
Score: 49.94160400740222
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To reliably assist human decision-making, LLMs must maintain factual internal beliefs against misleading injections. While current models resist explicit misinformation, we uncover a fundamental vulnerability to sophisticated, hard-to-falsify evidence. To systematically probe this weakness, we introduce MisBelief, a framework that generates misleading evidence via collaborative, multi-round interactions among multi-role LLMs. This process mimics subtle, defeasible reasoning and progressive refinement to create logically persuasive yet factually deceptive claims. Using MisBelief, we generate 4,800 instances across three difficulty levels to evaluate 7 representative LLMs. Results indicate that while models are robust to direct misinformation, they are highly sensitive to this refined evidence: belief scores in falsehoods increase by an average of 93.0\%, fundamentally compromising downstream recommendations. To address this, we propose Deceptive Intent Shielding (DIS), a governance mechanism that provides an early warning signal by inferring the deceptive intent behind evidence. Empirical results demonstrate that DIS consistently mitigates belief shifts and promotes more cautious evidence evaluation.

Related papers

Are Reasoning LLMs Robust to Interventions on Their Chain-of-Thought? [79.86483056611105]
Reasoning LLMs generate step-by-step chains of thought before giving an answer.<n>How robust are these reasoning traces to disruptions that occur within them?<n>We introduce a controlled evaluation framework that perturbs a model's own CoT at fixed timesteps.
arXiv Detail & Related papers (2026-02-07T10:02:58Z)
Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z)
Vulnerability of LLMs' Belief Systems? LLMs Belief Resistance Check Through Strategic Persuasive Conversation Interventions [8.026492468995187]
Small models exhibit extreme compliance, with over 80% of belief changes occurring at the first persuasive turn.<n> meta-cognition prompting increases vulnerability by accelerating belief erosion rather than enhancing robustness.<n>These findings highlight substantial model-dependent limits of current robustness interventions.
arXiv Detail & Related papers (2026-01-20T04:43:55Z)
Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency [78.91846841708586]
We show that even facts answered with perfect self-consistency can rapidly collapse under mild contextual interference.<n>We propose Neighbor-Consistency Belief (NCB), a structural measure of belief that evaluates response coherence across a conceptual neighborhood.<n>We also present Structure-Aware Training (SAT), which optimize context-invariant belief structure and reduces long-tail knowledge brittleness by approximately 30%.
arXiv Detail & Related papers (2026-01-09T16:23:21Z)
Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage [18.964773489734547]
As large language models (LLMs) transition to autonomous agents synthesizing real-time information, their reasoning capabilities introduce an unexpected attack surface.<n>This paper introduces a novel threat where colluding agents steer victim beliefs using only truthful evidence fragments distributed through public channels.
arXiv Detail & Related papers (2026-01-04T22:50:23Z)
Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL [64.3268313484078]
Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare.<n>Their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns.<n>We investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception.
arXiv Detail & Related papers (2025-10-16T05:29:36Z)
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts [79.1081247754018]
Large Language Models (LLMs) are widely deployed in reasoning, planning, and decision-making tasks.<n>We propose a framework based on Contact Searching Questions(CSQ) to quantify the likelihood of deception.
arXiv Detail & Related papers (2025-08-08T14:46:35Z)
CRAVE: A Conflicting Reasoning Approach for Explainable Claim Verification Using LLMs [15.170312674645535]
CRAVE is a Conflicting Reasoning Approach for explainable claim VErification.<n>It can verify complex claims based on the conflicting rationales reasoned by large language models.<n>CRAVE achieves much better performance than state-of-the-art methods.
arXiv Detail & Related papers (2025-04-21T07:20:31Z)
When Persuasion Overrides Truth in Multi-Agent LLM Debates: Introducing a Confidence-Weighted Persuasion Override Rate (CW-POR) [0.46040036610482665]
In many real-world scenarios, a single Large Language Model (LLM) may encounter contradictory claims-some accurate, others forcefully incorrect-and must judge which is true.<n>We investigate this risk in a single-turn, multi-agent debate framework: one LLM-based agent provides a factual answer from TruthfulQA, another vigorously defends a falsehood, and the same architecture serves as judge.<n>We introduce the Confidence-Weighted Persuasion Rate (CW-POR), which captures not only how often the judge is deceived but also how strongly it believes the incorrect choice.
arXiv Detail & Related papers (2025-04-01T02:45:02Z)
Missci: Reconstructing Fallacies in Misrepresented Science [84.32990746227385]
Health-related misinformation on social networks can lead to poor decision-making and real-world dangers. Missci is a novel argumentation theoretical model for fallacious reasoning. We present Missci as a dataset to test the critical reasoning abilities of large language models.
arXiv Detail & Related papers (2024-06-05T12:11:10Z)
FaithLM: Towards Faithful Explanations for Large Language Models [60.45183469474916]
We introduce FaithLM, a model-agnostic framework that evaluates and improves the faithfulness of large language models.<n>We show that FaithLM consistently increases faithfulness and produces explanations more aligned with human rationales than strong self-explanation baselines.
arXiv Detail & Related papers (2024-02-07T09:09:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.