Related papers: Explainable but Vulnerable: Adversarial Attacks on XAI Explanation in Cybersecurity Applications

Explainable but Vulnerable: Adversarial Attacks on XAI Explanation in Cybersecurity Applications

URL: http://arxiv.org/abs/2510.03623v1
Date: Sat, 04 Oct 2025 02:07:58 GMT
Title: Explainable but Vulnerable: Adversarial Attacks on XAI Explanation in Cybersecurity Applications
Authors: Maraz Mia, Mir Mehedi A. Pritom,
Abstract summary: Explainable Artificial Intelligence (XAI) has aided machine learning (ML) researchers with the power of scrutinizing the decisions of the black-box models.<n>XAI methods can themselves be a victim of post-adversarial attacks that manipulate the expected outcome from the explanation module.
Score: 0.21485350418225244
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Explainable Artificial Intelligence (XAI) has aided machine learning (ML) researchers with the power of scrutinizing the decisions of the black-box models. XAI methods enable looking deep inside the models' behavior, eventually generating explanations along with a perceived trust and transparency. However, depending on any specific XAI method, the level of trust can vary. It is evident that XAI methods can themselves be a victim of post-adversarial attacks that manipulate the expected outcome from the explanation module. Among such attack tactics, fairwashing explanation (FE), manipulation explanation (ME), and backdoor-enabled manipulation attacks (BD) are the notable ones. In this paper, we try to understand these adversarial attack techniques, tactics, and procedures (TTPs) on explanation alteration and thus the effect on the model's decisions. We have explored a total of six different individual attack procedures on post-hoc explanation methods such as SHAP (SHapley Additive exPlanations), LIME (Local Interpretable Model-agnostic Explanation), and IG (Integrated Gradients), and investigated those adversarial attacks in cybersecurity applications scenarios such as phishing, malware, intrusion, and fraudulent website detection. Our experimental study reveals the actual effectiveness of these attacks, thus providing an urgency for immediate attention to enhance the resiliency of XAI methods and their applications.

Related papers

eXIAA: eXplainable Injections for Adversarial Attack [3.512208543873998]
We show a new black-box model-agnostic adversarial attack for post-hoc explainable Artificial Intelligence (XAI)<n>The goal of the attack is to modify the original explanations while being undetected by the human eye and maintain the same predicted class.<n>The low requirements of our method expose a critical vulnerability in current explainability methods, raising concerns about their reliability in safety-critical applications.
arXiv Detail & Related papers (2025-11-13T08:42:24Z)
Quantifying Loss Aversion in Cyber Adversaries via LLM Analysis [2.798191832420146]
IARPA's ReSCIND program seeks to infer, defend against, and exploit attacker cognitive traits.<n>In this paper, we present a novel methodology that leverages large language models (LLMs) to extract quantifiable insights into the cognitive bias of loss aversion from hacker behavior.
arXiv Detail & Related papers (2025-08-18T05:51:30Z)
How Vulnerable Is My Learned Policy? Universal Adversarial Perturbation Attacks On Modern Behavior Cloning Policies [15.999261636389702]
Learning from Demonstration (LfD) algorithms have shown promising results in robotic manipulation tasks.<n>But their vulnerability to offline universal perturbation attacks remains underexplored.<n>This paper presents a comprehensive study of adversarial attacks on both classic and recently proposed algorithms.
arXiv Detail & Related papers (2025-02-06T01:17:39Z)
Defense Against Prompt Injection Attack by Leveraging Attack Techniques [66.65466992544728]
Large language models (LLMs) have achieved remarkable performance across various natural language processing (NLP) tasks.<n>As LLMs continue to evolve, new vulnerabilities, especially prompt injection attacks arise.<n>Recent attack methods leverage LLMs' instruction-following abilities and their inabilities to distinguish instructions injected in the data content.
arXiv Detail & Related papers (2024-11-01T09:14:21Z)
Revealing Vulnerabilities of Neural Networks in Parameter Learning and Defense Against Explanation-Aware Backdoors [2.1165011830664673]
Blinding attacks can drastically alter a machine learning algorithm's prediction and explanation. We leverage statistical analysis to highlight the changes in CNN weights within a CNN following blinding attacks. We introduce a method specifically designed to limit the effectiveness of such attacks during the evaluation phase.
arXiv Detail & Related papers (2024-03-25T09:36:10Z)
SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios. We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z)
Attention-Enhancing Backdoor Attacks Against BERT-based Models [54.070555070629105]
Investigating the strategies of backdoor attacks will help to understand the model's vulnerability. We propose a novel Trojan Attention Loss (TAL) which enhances the Trojan behavior by directly manipulating the attention patterns.
arXiv Detail & Related papers (2023-10-23T01:24:56Z)
Adversarial attacks and defenses in explainable artificial intelligence: A survey [9.769695768744421]
Recent advances in adversarial machine learning (AdvML) highlight the limitations and vulnerabilities of state-of-the-art explanation methods.<n>This survey provides a comprehensive overview of research concerning adversarial attacks on explanations of machine learning models.
arXiv Detail & Related papers (2023-06-06T09:53:39Z)
AUTOLYCUS: Exploiting Explainable AI (XAI) for Model Extraction Attacks against Interpretable Models [1.8752655643513647]
XAI tools can increase the vulnerability of model extraction attacks, which is a concern when model owners prefer black-box access. We propose a novel retraining (learning) based model extraction attack framework against interpretable models under black-box settings. We show that AUTOLYCUS is highly effective, requiring significantly fewer queries compared to state-of-the-art attacks.
arXiv Detail & Related papers (2023-02-04T13:23:39Z)
Illusory Attacks: Information-Theoretic Detectability Matters in Adversarial Attacks [76.35478518372692]
We introduce epsilon-illusory, a novel form of adversarial attack on sequential decision-makers. Compared to existing attacks, we empirically find epsilon-illusory to be significantly harder to detect with automated methods. Our findings suggest the need for better anomaly detectors, as well as effective hardware- and system-level defenses.
arXiv Detail & Related papers (2022-07-20T19:49:09Z)
The Feasibility and Inevitability of Stealth Attacks [63.14766152741211]
We study new adversarial perturbations that enable an attacker to gain control over decisions in generic Artificial Intelligence systems. In contrast to adversarial data modification, the attack mechanism we consider here involves alterations to the AI system itself.
arXiv Detail & Related papers (2021-06-26T10:50:07Z)
Adversarial EXEmples: A Survey and Experimental Evaluation of Practical Attacks on Machine Learning for Windows Malware Detection [67.53296659361598]
adversarial EXEmples can bypass machine learning-based detection by perturbing relatively few input bytes. We develop a unifying framework that does not only encompass and generalize previous attacks against machine-learning models, but also includes three novel attacks. These attacks, named Full DOS, Extend and Shift, inject the adversarial payload by respectively manipulating the DOS header, extending it, and shifting the content of the first section.
arXiv Detail & Related papers (2020-08-17T07:16:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.