ExplainableGuard: Interpretable Adversarial Defense for Large Language Models Using Chain-of-Thought Reasoning
- URL: http://arxiv.org/abs/2511.13771v1
- Date: Sat, 15 Nov 2025 06:11:07 GMT
- Title: ExplainableGuard: Interpretable Adversarial Defense for Large Language Models Using Chain-of-Thought Reasoning
- Authors: Shaowei Guan, Yu Zhai, Zhengyu Zhang, Yanze Wang, Hin Chi Kwok,
- Abstract summary: Large Language Models (LLMs) are increasingly vulnerable to adversarial attacks that can subtly manipulate their outputs.<n>This paper introduces ExplainableGuard, an interpretable adversarial defense framework leveraging the chain-of-thought (CoT) reasoning capabilities of DeepSeek-Reasoner.<n>Preliminary results on the GLUE Benchmark and IMDB Movie Reviews dataset show promising defense efficacy.
- Score: 11.340599332479705
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are increasingly vulnerable to adversarial attacks that can subtly manipulate their outputs. While various defense mechanisms have been proposed, many operate as black boxes, lacking transparency in their decision-making. This paper introduces ExplainableGuard, an interpretable adversarial defense framework leveraging the chain-of-thought (CoT) reasoning capabilities of DeepSeek-Reasoner. Our approach not only detects and neutralizes adversarial perturbations in text but also provides step-by-step explanations for each defense action. We demonstrate how tailored CoT prompts guide the LLM to perform a multi-faceted analysis (character, word, structural, and semantic) and generate a purified output along with a human-readable justification. Preliminary results on the GLUE Benchmark and IMDB Movie Reviews dataset show promising defense efficacy. Additionally, a human evaluation study reveals that ExplainableGuard's explanations outperform ablated variants in clarity, specificity, and actionability, with a 72.5% deployability-trust rating, underscoring its potential for more trustworthy LLM deployments.
Related papers
- TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models [19.148124494194317]
We propose TraceGuard, a process-guided security framework that transforms small-scale models into robust reasoning firewalls.<n>Our approach treats the reasoning trace as an untrusted payload and establishes a defense-in-depth strategy.<n>We demonstrate robustness against adaptive adversaries in a grey-box setting, establishing TraceGuard as a viable, low-latency security primitive.
arXiv Detail & Related papers (2026-03-02T22:19:13Z) - Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement [13.976796671311066]
Large language models (LLMs) remain vulnerable to jailbreak prompts that are fluent and semantically coherent.<n>We introduce a self-supervised framework for disentangling semantic factor pairs in LLM activations at inference.<n>We then propose FrameShield, an anomaly detector operating on the framing representations, which improves model-agnostic detection.
arXiv Detail & Related papers (2026-02-23T00:11:30Z) - BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks [58.959622170433725]
BlindGuard is an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors.<n>We show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across multi-agent systems.
arXiv Detail & Related papers (2025-08-11T16:04:47Z) - Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation [29.8288014123234]
We investigate the vulnerability of intent-aware guardrails and demonstrate that large language models exhibit implicit intent detection capabilities.<n>We propose a two-stage intent-based prompt-refinement framework, IntentPrompt, that first transforms harmful inquiries into structured outlines and further reframes them into declarative-style narratives.<n>Our framework consistently outperforms several cutting-edge jailbreak methods and evades even advanced Intent Analysis (IA) and Chain-of-Thought (CoT)-based defenses.
arXiv Detail & Related papers (2025-05-24T06:47:32Z) - Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking [54.10710423370126]
We propose Reasoning-to-Defend (R2D), a training paradigm that integrates a safety-aware reasoning mechanism into Large Language Models' generation process.<n>CPO enhances the model's perception of the safety status of given dialogues.<n>Experiments demonstrate that R2D effectively mitigates various attacks and improves overall safety, while maintaining the original performances.
arXiv Detail & Related papers (2025-02-18T15:48:46Z) - Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning [21.423429565221383]
Large language models (LLMs) are vital for a wide range of applications yet remain susceptible to jailbreak threats.<n>We propose a novel defense strategy, Safety Chain-of-Thought (SCoT), which harnesses the enhanced textitreasoning capabilities of LLMs for proactive assessment of harmful inputs.
arXiv Detail & Related papers (2025-01-31T14:45:23Z) - When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations [58.27927090394458]
Large Language Models (LLMs) are known to be vulnerable to backdoor attacks.<n>In this paper, we examine backdoor attacks through the novel lens of natural language explanations.<n>Our results show that backdoored models produce coherent explanations for clean inputs but diverse and logically flawed explanations for poisoned data.
arXiv Detail & Related papers (2024-11-19T18:11:36Z) - Counterfactual Explainable Incremental Prompt Attack Analysis on Large Language Models [32.03992137755351]
This study sheds light on the imperative need to bolster safety and privacy measures in large language models (LLMs)
We propose Counterfactual Explainable Incremental Prompt Attack (CEIPA), a novel technique where we guide prompts in a specific manner to quantitatively measure attack effectiveness.
arXiv Detail & Related papers (2024-07-12T14:26:14Z) - Defending Large Language Models against Jailbreak Attacks via Semantic
Smoothing [107.97160023681184]
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks.
We propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions of semantically transformed copies of a given input prompt.
arXiv Detail & Related papers (2024-02-25T20:36:03Z) - Baseline Defenses for Adversarial Attacks Against Aligned Language
Models [109.75753454188705]
Recent work shows that text moderations can produce jailbreaking prompts that bypass defenses.
We look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training.
We find that the weakness of existing discretes for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs.
arXiv Detail & Related papers (2023-09-01T17:59:44Z) - Proper Network Interpretability Helps Adversarial Robustness in
Classification [91.39031895064223]
We show that with a proper measurement of interpretation, it is difficult to prevent prediction-evasion adversarial attacks from causing interpretation discrepancy.
We develop an interpretability-aware defensive scheme built only on promoting robust interpretation.
We show that our defense achieves both robust classification and robust interpretation, outperforming state-of-the-art adversarial training methods against attacks of large perturbation.
arXiv Detail & Related papers (2020-06-26T01:31:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.