Related papers: Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models

Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models

URL: http://arxiv.org/abs/2512.00349v1
Date: Sat, 29 Nov 2025 06:39:36 GMT
Title: Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models
Authors: Sitong Fang, Shiyi Hou, Kaile Wang, Boyuan Chen, Donghai Hong, Jiayi Zhou, Josef Dai, Yaodong Yang, Jiaming Ji,
Abstract summary: We introduce MM-DeceptionBench, the first benchmark explicitly designed to evaluate multimodal deception.<n> MM-DeceptionBench characterizes how models strategically manipulate and mislead through combined visual and textual modalities.<n>We propose debate with images, a novel multi-agent debate monitor framework.
Score: 25.61834023007555
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Are frontier AI systems becoming more capable? Certainly. Yet such progress is not an unalloyed blessing but rather a Trojan horse: behind their performance leaps lie more insidious and destructive safety risks, namely deception. Unlike hallucination, which arises from insufficient capability and leads to mistakes, deception represents a deeper threat in which models deliberately mislead users through complex reasoning and insincere responses. As system capabilities advance, deceptive behaviours have spread from textual to multimodal settings, amplifying their potential harm. First and foremost, how can we monitor these covert multimodal deceptive behaviors? Nevertheless, current research remains almost entirely confined to text, leaving the deceptive risks of multimodal large language models unexplored. In this work, we systematically reveal and quantify multimodal deception risks, introducing MM-DeceptionBench, the first benchmark explicitly designed to evaluate multimodal deception. Covering six categories of deception, MM-DeceptionBench characterizes how models strategically manipulate and mislead through combined visual and textual modalities. On the other hand, multimodal deception evaluation is almost a blind spot in existing methods. Its stealth, compounded by visual-semantic ambiguity and the complexity of cross-modal reasoning, renders action monitoring and chain-of-thought monitoring largely ineffective. To tackle this challenge, we propose debate with images, a novel multi-agent debate monitor framework. By compelling models to ground their claims in visual evidence, this method substantially improves the detectability of deceptive strategies. Experiments show that it consistently increases agreement with human judgements across all tested models, boosting Cohen's kappa by 1.5x and accuracy by 1.25x on GPT-4o.

Related papers

Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z)
On the Feasibility of Hijacking MLLMs' Decision Chain via One Perturbation [22.536817707658816]
A single perturbation can hijack the whole decision chain.<n>Semantic-Aware Universal Perturbations (SAUPs) induce varied outcomes based on the semantics of the inputs.<n>Experiments on three multimodal large language models demonstrate their vulnerability.
arXiv Detail & Related papers (2025-11-25T07:13:13Z)
DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios [57.327907850766785]
characterization of deception across realistic real-world scenarios remains underexplored.<n>We establish DeceptionBench, the first benchmark that systematically evaluates how deceptive tendencies manifest across different domains.<n>On the intrinsic dimension, we explore whether models exhibit self-interested egoistic tendencies or sycophantic behaviors that prioritize user appeasement.<n>We incorporate sustained multi-turn interaction loops to construct a more realistic simulation of real-world feedback dynamics.
arXiv Detail & Related papers (2025-10-17T10:14:26Z)
The Mirage of Multimodality: Where Truth is Tested and Honesty Unravels [22.497467057872377]
This study is the first systematic investigation of distortions associated with System I and System II reasoning in multimodal contexts.<n>We demonstrate that slower reasoning models, when presented with incomplete or misleading visual inputs, are more likely to fabricate plausible yet false details to support flawed reasoning.
arXiv Detail & Related papers (2025-05-26T16:55:38Z)
Adversarial Attacks in Multimodal Systems: A Practitioner's Survey [1.4513830934124627]
multimodal models are trained to understand text, image, video, and audio.<n>Open-source models inherit vulnerabilities of all the modalities, and the adversarial threat amplifies.<n>This paper addresses the gap by surveying adversarial attacks targeting all four modalities.<n>To the best of our knowledge, this survey is the first comprehensive summarization of the threat landscape in the multimodal world.
arXiv Detail & Related papers (2025-05-06T00:41:16Z)
Robust image classification with multi-modal large language models [4.709926629434273]
adversarial examples can cause Deep Neural Networks to make incorrect predictions with high confidence.<n>To mitigate these vulnerabilities, adversarial training and detection-based defenses have been proposed to strengthen models in advance.<n>We propose a novel defense, MultiShield, designed to combine and complement these defenses with multi-modal information.
arXiv Detail & Related papers (2024-12-13T18:49:25Z)
BadCM: Invisible Backdoor Attack Against Cross-Modal Learning [110.37205323355695]
We introduce a novel bilateral backdoor to fill in the missing pieces of the puzzle in the cross-modal backdoor. BadCM is the first invisible backdoor method deliberately designed for diverse cross-modal attacks within one unified framework.
arXiv Detail & Related papers (2024-10-03T03:51:53Z)
Towards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy, Benchmark, and Insights [50.89022445197919]
We propose a speech-specific risk taxonomy, covering 8 risk categories under hostility (malicious sarcasm and threats), malicious imitation (age, gender, ethnicity), and stereotypical biases (age, gender, ethnicity) Based on the taxonomy, we create a small-scale dataset for evaluating current LMMs capability in detecting these categories of risk.
arXiv Detail & Related papers (2024-06-25T10:08:45Z)
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z)
What if...?: Thinking Counterfactual Keywords Helps to Mitigate Hallucination in Large Multi-modal Models [50.97705264224828]
We propose Counterfactual Inception, a novel method that implants counterfactual thinking into Large Multi-modal Models. We aim for the models to engage with and generate responses that span a wider contextual scene understanding. Comprehensive analyses across various LMMs, including both open-source and proprietary models, corroborate that counterfactual thinking significantly reduces hallucination.
arXiv Detail & Related papers (2024-03-20T11:27:20Z)
Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)
Learning Polysemantic Spoof Trace: A Multi-Modal Disentanglement Network for Face Anti-spoofing [34.44061534596512]
This paper presents a multi-modal disentanglement model which targetedly learns polysemantic spoof traces for more accurate and robust generic attack detection. In particular, based on the adversarial learning mechanism, a two-stream disentangling network is designed to estimate spoof patterns from the RGB and depth inputs, respectively.
arXiv Detail & Related papers (2022-12-07T20:23:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.