Backdooring Explainable Machine Learning
- URL: http://arxiv.org/abs/2204.09498v1
- Date: Wed, 20 Apr 2022 14:40:09 GMT
- Title: Backdooring Explainable Machine Learning
- Authors: Maximilian Noppel and Lukas Peter and Christian Wressnegger
- Abstract summary: We demonstrate blinding attacks that can fully disguise an ongoing attack against the machine learning model.
Similar to neural backdoors, we modify the model's prediction upon trigger presence but simultaneously also fool the provided explanation.
- Score: 0.8180960351554997
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Explainable machine learning holds great potential for analyzing and
understanding learning-based systems. These methods can, however, be
manipulated to present unfaithful explanations, giving rise to powerful and
stealthy adversaries. In this paper, we demonstrate blinding attacks that can
fully disguise an ongoing attack against the machine learning model. Similar to
neural backdoors, we modify the model's prediction upon trigger presence but
simultaneously also fool the provided explanation. This enables an adversary to
hide the presence of the trigger or point the explanation to entirely different
portions of the input, throwing a red herring. We analyze different
manifestations of such attacks for different explanation types in the image
domain, before we resume to conduct a red-herring attack against malware
classification.
Related papers
- Psychometrics for Hypnopaedia-Aware Machinery via Chaotic Projection of Artificial Mental Imagery [21.450023199935206]
A backdoor attack involves the clandestine infiltration of a trigger during the learning process.
We propose a cybernetic framework for constant surveillance of backdoors threats.
We develop a self-aware unlearning mechanism to autonomously detach a machine's behaviour from the backdoor trigger.
arXiv Detail & Related papers (2024-09-29T00:59:26Z) - Investigating Human-Identifiable Features Hidden in Adversarial
Perturbations [54.39726653562144]
Our study explores up to five attack algorithms across three datasets.
We identify human-identifiable features in adversarial perturbations.
Using pixel-level annotations, we extract such features and demonstrate their ability to compromise target models.
arXiv Detail & Related papers (2023-09-28T22:31:29Z) - XRand: Differentially Private Defense against Explanation-Guided Attacks [19.682368614810756]
We introduce a new concept of achieving local differential privacy (LDP) in the explanations.
We show that our mechanism restricts the information that the adversary can learn about the top important features, while maintaining the faithfulness of the explanations.
arXiv Detail & Related papers (2022-12-08T18:23:59Z) - Detect & Reject for Transferability of Black-box Adversarial Attacks
Against Network Intrusion Detection Systems [0.0]
We investigate the transferability of adversarial network traffic against machine learning-based intrusion detection systems.
We examine Detect & Reject as a defensive mechanism to limit the effect of the transferability property of adversarial network traffic against machine learning-based intrusion detection systems.
arXiv Detail & Related papers (2021-12-22T17:54:54Z) - Attack to Fool and Explain Deep Networks [59.97135687719244]
We counter-argue by providing evidence of human-meaningful patterns in adversarial perturbations.
Our major contribution is a novel pragmatic adversarial attack that is subsequently transformed into a tool to interpret the visual models.
arXiv Detail & Related papers (2021-06-20T03:07:36Z) - Backdoor Attack in the Physical World [49.64799477792172]
Backdoor attack intends to inject hidden backdoor into the deep neural networks (DNNs)
Most existing backdoor attacks adopted the setting of static trigger, $i.e.,$ triggers across the training and testing images.
We demonstrate that this attack paradigm is vulnerable when the trigger in testing images is not consistent with the one used for training.
arXiv Detail & Related papers (2021-04-06T08:37:33Z) - This is not the Texture you are looking for! Introducing Novel
Counterfactual Explanations for Non-Experts using Generative Adversarial
Learning [59.17685450892182]
counterfactual explanation systems try to enable a counterfactual reasoning by modifying the input image.
We present a novel approach to generate such counterfactual image explanations based on adversarial image-to-image translation techniques.
Our results show that our approach leads to significantly better results regarding mental models, explanation satisfaction, trust, emotions, and self-efficacy than two state-of-the art systems.
arXiv Detail & Related papers (2020-12-22T10:08:05Z) - A simple defense against adversarial attacks on heatmap explanations [6.312527106205531]
A potential concern is the so-called "fair-washing"
manipulating a model such that the features used in reality are hidden and more innocuous features are shown to be important instead.
We present an effective defence against such adversarial attacks on neural networks.
arXiv Detail & Related papers (2020-07-13T13:44:13Z) - Adversarial Attacks and Defenses: An Interpretation Perspective [80.23908920686625]
We review recent work on adversarial attacks and defenses, particularly from the perspective of machine learning interpretation.
The goal of model interpretation, or interpretable machine learning, is to extract human-understandable terms for the working mechanism of models.
For each type of interpretation, we elaborate on how it could be used for adversarial attacks and defenses.
arXiv Detail & Related papers (2020-04-23T23:19:00Z) - Rethinking the Trigger of Backdoor Attack [83.98031510668619]
Currently, most of existing backdoor attacks adopted the setting of emphstatic trigger, $i.e.,$ triggers across the training and testing images follow the same appearance and are located in the same area.
We demonstrate that such an attack paradigm is vulnerable when the trigger in testing images is not consistent with the one used for training.
arXiv Detail & Related papers (2020-04-09T17:19:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.