When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making
- URL: http://arxiv.org/abs/2602.04003v2
- Date: Thu, 12 Feb 2026 14:52:34 GMT
- Title: When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making
- Authors: Shutong Fan, Lan Zhang, Xiaoyong Yuan,
- Abstract summary: Large Language Models generate fluent natural-language explanations that shape how users perceive and trust AI outputs.<n>We introduce adversarial explanation attacks (AEAs), where an attacker manipulates the framing of LLM-generated explanations to modulate human trust in incorrect outputs.<n>This is the first systematic security study that treats explanations as an adversarial cognitive channel and quantifies their impact on human trust in AI-assisted decision making.
- Score: 7.170587130743388
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most adversarial threats in artificial intelligence target the computational behavior of models rather than the humans who rely on them. Yet modern AI systems increasingly operate within human decision loops, where users interpret and act on model recommendations. Large Language Models generate fluent natural-language explanations that shape how users perceive and trust AI outputs, revealing a new attack surface at the cognitive layer: the communication channel between AI and its users. We introduce adversarial explanation attacks (AEAs), where an attacker manipulates the framing of LLM-generated explanations to modulate human trust in incorrect outputs. We formalize this behavioral threat through the trust miscalibration gap, a metric that captures the difference in human trust between correct and incorrect outputs under adversarial explanations. By incorporating this gap, AEAs explore the daunting threats in which persuasive explanations reinforce users' trust in incorrect predictions. To characterize this threat, we conducted a controlled experiment (n = 205), systematically varying four dimensions of explanation framing: reasoning mode, evidence type, communication style, and presentation format. Our findings show that users report nearly identical trust for adversarial and benign explanations, with adversarial explanations preserving the vast majority of benign trust despite being incorrect. The most vulnerable cases arise when AEAs closely resemble expert communication, combining authoritative evidence, neutral tone, and domain-appropriate reasoning. Vulnerability is highest on hard tasks, in fact-driven domains, and among participants who are less formally educated, younger, or highly trusting of AI. This is the first systematic security study that treats explanations as an adversarial cognitive channel and quantifies their impact on human trust in AI-assisted decision making.
Related papers
- Human-Centered Explainability in AI-Enhanced UI Security Interfaces: Designing Trustworthy Copilots for Cybersecurity Analysts [0.0]
We present a mixed-methods study of explanation design strategies in AI-driven security dashboards.<n>Our findings show that explanation style significantly affects user trust calibration, decision accuracy, and cognitive load.<n>This work advances the design of human-centered AI tools in cybersecurity and provides broader implications for explainability in other high-stakes domains.
arXiv Detail & Related papers (2026-01-30T07:18:20Z) - Engaging with AI: How Interface Design Shapes Human-AI Collaboration in High-Stakes Decision-Making [8.948482790298645]
We examine how various decision-support mechanisms impact user engagement, trust, and human-AI collaborative task performance.<n>Our findings reveal that mechanisms like AI confidence levels, text explanations, and performance visualizations enhanced human-AI collaborative task performance.
arXiv Detail & Related papers (2025-01-28T02:03:00Z) - Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [50.40122190627256]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.<n>PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.<n>To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z) - Deceptive AI systems that give explanations are more convincing than honest AI systems and can amplify belief in misinformation [29.022316418575866]
We examined the impact of deceptive AI generated explanations on individuals' beliefs.
Our results show that personal factors such as cognitive reflection and trust in AI do not necessarily protect individuals from these effects.
This underscores the importance of teaching logical reasoning and critical thinking skills to identify logically invalid arguments.
arXiv Detail & Related papers (2024-07-31T05:39:07Z) - A Diachronic Perspective on User Trust in AI under Uncertainty [52.44939679369428]
Modern NLP systems are often uncalibrated, resulting in confidently incorrect predictions that undermine user trust.
We study the evolution of user trust in response to trust-eroding events using a betting game.
arXiv Detail & Related papers (2023-10-20T14:41:46Z) - The Response Shift Paradigm to Quantify Human Trust in AI
Recommendations [6.652641137999891]
Explainability, interpretability and how much they affect human trust in AI systems are ultimately problems of human cognition as much as machine learning.
We developed and validated a general purpose Human-AI interaction paradigm which quantifies the impact of AI recommendations on human decisions.
Our proof-of-principle paradigm allows one to quantitatively compare the rapidly growing set of XAI/IAI approaches in terms of their effect on the end-user.
arXiv Detail & Related papers (2022-02-16T22:02:09Z) - Cybertrust: From Explainable to Actionable and Interpretable AI (AI2) [58.981120701284816]
Actionable and Interpretable AI (AI2) will incorporate explicit quantifications and visualizations of user confidence in AI recommendations.
It will allow examining and testing of AI system predictions to establish a basis for trust in the systems' decision making.
arXiv Detail & Related papers (2022-01-26T18:53:09Z) - The Who in XAI: How AI Background Shapes Perceptions of AI Explanations [61.49776160925216]
We conduct a mixed-methods study of how two different groups--people with and without AI background--perceive different types of AI explanations.
We find that (1) both groups showed unwarranted faith in numbers for different reasons and (2) each group found value in different explanations beyond their intended design.
arXiv Detail & Related papers (2021-07-28T17:32:04Z) - Formalizing Trust in Artificial Intelligence: Prerequisites, Causes and
Goals of Human Trust in AI [55.4046755826066]
We discuss a model of trust inspired by, but not identical to, sociology's interpersonal trust (i.e., trust between people)
We incorporate a formalization of 'contractual trust', such that trust between a user and an AI is trust that some implicit or explicit contract will hold.
We discuss how to design trustworthy AI, how to evaluate whether trust has manifested, and whether it is warranted.
arXiv Detail & Related papers (2020-10-15T03:07:23Z) - Deceptive AI Explanations: Creation and Detection [3.197020142231916]
We investigate how AI models can be used to create and detect deceptive explanations.
As an empirical evaluation, we focus on text classification and alter the explanations generated by GradCAM.
We evaluate the effect of deceptive explanations on users in an experiment with 200 participants.
arXiv Detail & Related papers (2020-01-21T16:41:22Z) - Effect of Confidence and Explanation on Accuracy and Trust Calibration
in AI-Assisted Decision Making [53.62514158534574]
We study whether features that reveal case-specific model information can calibrate trust and improve the joint performance of the human and AI.
We show that confidence score can help calibrate people's trust in an AI model, but trust calibration alone is not sufficient to improve AI-assisted decision making.
arXiv Detail & Related papers (2020-01-07T15:33:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.