eXIAA: eXplainable Injections for Adversarial Attack
- URL: http://arxiv.org/abs/2511.10088v1
- Date: Fri, 14 Nov 2025 01:31:11 GMT
- Title: eXIAA: eXplainable Injections for Adversarial Attack
- Authors: Leonardo Pesce, Jiawen Wei, Gianmarco Mengaldo,
- Abstract summary: We show a new black-box model-agnostic adversarial attack for post-hoc explainable Artificial Intelligence (XAI)<n>The goal of the attack is to modify the original explanations while being undetected by the human eye and maintain the same predicted class.<n>The low requirements of our method expose a critical vulnerability in current explainability methods, raising concerns about their reliability in safety-critical applications.
- Score: 3.512208543873998
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Post-hoc explainability methods are a subset of Machine Learning (ML) that aim to provide a reason for why a model behaves in a certain way. In this paper, we show a new black-box model-agnostic adversarial attack for post-hoc explainable Artificial Intelligence (XAI), particularly in the image domain. The goal of the attack is to modify the original explanations while being undetected by the human eye and maintain the same predicted class. In contrast to previous methods, we do not require any access to the model or its weights, but only to the model's computed predictions and explanations. Additionally, the attack is accomplished in a single step while significantly changing the provided explanations, as demonstrated by empirical evaluation. The low requirements of our method expose a critical vulnerability in current explainability methods, raising concerns about their reliability in safety-critical applications. We systematically generate attacks based on the explanations generated by post-hoc explainability methods (saliency maps, integrated gradients, and DeepLIFT SHAP) for pretrained ResNet-18 and ViT-B16 on ImageNet. The results show that our attacks could lead to dramatically different explanations without changing the predictive probabilities. We validate the effectiveness of our attack, compute the induced change based on the explanation with mean absolute difference, and verify the closeness of the original image and the corrupted one with the Structural Similarity Index Measure (SSIM).
Related papers
- Explainable but Vulnerable: Adversarial Attacks on XAI Explanation in Cybersecurity Applications [0.21485350418225244]
Explainable Artificial Intelligence (XAI) has aided machine learning (ML) researchers with the power of scrutinizing the decisions of the black-box models.<n>XAI methods can themselves be a victim of post-adversarial attacks that manipulate the expected outcome from the explanation module.
arXiv Detail & Related papers (2025-10-04T02:07:58Z) - Fooling SHAP with Output Shuffling Attacks [4.873272103738719]
Explainable AI(XAI) methods such as SHAP can help discover feature attributions in black-box models.
adversarial attacks can subvert the detection of XAI methods.
We propose a novel family of attacks, called shuffling attacks, that are data-agnostic.
arXiv Detail & Related papers (2024-08-12T21:57:18Z) - MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models.
Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs.
Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z) - Revealing Vulnerabilities of Neural Networks in Parameter Learning and Defense Against Explanation-Aware Backdoors [2.1165011830664673]
Blinding attacks can drastically alter a machine learning algorithm's prediction and explanation.
We leverage statistical analysis to highlight the changes in CNN weights within a CNN following blinding attacks.
We introduce a method specifically designed to limit the effectiveness of such attacks during the evaluation phase.
arXiv Detail & Related papers (2024-03-25T09:36:10Z) - Black-box Attacks on Image Activity Prediction and its Natural Language
Explanations [27.301741710016223]
Explainable AI (XAI) methods aim to describe the decision process of deep neural networks.
Visual XAI methods have been shown to be vulnerable to white-box and gray-box adversarial attacks.
We show that we can create adversarial images that manipulate the explanations of an activity recognition model by having access only to its final output.
arXiv Detail & Related papers (2023-09-30T21:56:43Z) - Semantic Image Attack for Visual Model Diagnosis [80.36063332820568]
In practice, metric analysis on a specific train and test dataset does not guarantee reliable or fair ML models.
This paper proposes Semantic Image Attack (SIA), a method based on the adversarial attack that provides semantic adversarial images.
arXiv Detail & Related papers (2023-03-23T03:13:04Z) - Meta Adversarial Perturbations [66.43754467275967]
We show the existence of a meta adversarial perturbation (MAP)
MAP causes natural images to be misclassified with high probability after being updated through only a one-step gradient ascent update.
We show that these perturbations are not only image-agnostic, but also model-agnostic, as a single perturbation generalizes well across unseen data points and different neural network architectures.
arXiv Detail & Related papers (2021-11-19T16:01:45Z) - Adaptive Feature Alignment for Adversarial Training [56.17654691470554]
CNNs are typically vulnerable to adversarial attacks, which pose a threat to security-sensitive applications.
We propose the adaptive feature alignment (AFA) to generate features of arbitrary attacking strengths.
Our method is trained to automatically align features of arbitrary attacking strength.
arXiv Detail & Related papers (2021-05-31T17:01:05Z) - ExAD: An Ensemble Approach for Explanation-based Adversarial Detection [17.455233006559734]
We propose ExAD, a framework to detect adversarial examples using an ensemble of explanation techniques.
We evaluate our approach using six state-of-the-art adversarial attacks on three image datasets.
arXiv Detail & Related papers (2021-03-22T00:53:07Z) - Model extraction from counterfactual explanations [68.8204255655161]
We show how an adversary can leverage the information provided by counterfactual explanations to build high-fidelity and high-accuracy model extraction attacks.
Our attack enables the adversary to build a faithful copy of a target model by accessing its counterfactual explanations.
arXiv Detail & Related papers (2020-09-03T19:02:55Z) - Anomaly Detection-Based Unknown Face Presentation Attack Detection [74.4918294453537]
Anomaly detection-based spoof attack detection is a recent development in face Presentation Attack Detection.
In this paper, we present a deep-learning solution for anomaly detection-based spoof attack detection.
The proposed approach benefits from the representation learning power of the CNNs and learns better features for fPAD task.
arXiv Detail & Related papers (2020-07-11T21:20:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.