Related papers: A simple defense against adversarial attacks on heatmap explanations

A simple defense against adversarial attacks on heatmap explanations

URL: http://arxiv.org/abs/2007.06381v1
Date: Mon, 13 Jul 2020 13:44:13 GMT
Title: A simple defense against adversarial attacks on heatmap explanations
Authors: Laura Rieger, Lars Kai Hansen
Abstract summary: A potential concern is the so-called "fair-washing" manipulating a model such that the features used in reality are hidden and more innocuous features are shown to be important instead. We present an effective defence against such adversarial attacks on neural networks.
Score: 6.312527106205531
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With machine learning models being used for more sensitive applications, we rely on interpretability methods to prove that no discriminating attributes were used for classification. A potential concern is the so-called "fair-washing" - manipulating a model such that the features used in reality are hidden and more innocuous features are shown to be important instead. In our work we present an effective defence against such adversarial attacks on neural networks. By a simple aggregation of multiple explanation methods, the network becomes robust against manipulation. This holds even when the attacker has exact knowledge of the model weights and the explanation methods used.

Related papers

Edge-Only Universal Adversarial Attacks in Distributed Learning [49.546479320670464]
In this work, we explore the feasibility of generating universal adversarial attacks when an attacker has access to the edge part of the model only. Our approach shows that adversaries can induce effective mispredictions in the unknown cloud part by leveraging key features on the edge side. Our results on ImageNet demonstrate strong attack transferability to the unknown cloud part.
arXiv Detail & Related papers (2024-11-15T11:06:24Z)
Investigating Human-Identifiable Features Hidden in Adversarial Perturbations [54.39726653562144]
Our study explores up to five attack algorithms across three datasets. We identify human-identifiable features in adversarial perturbations. Using pixel-level annotations, we extract such features and demonstrate their ability to compromise target models.
arXiv Detail & Related papers (2023-09-28T22:31:29Z)
MOVE: Effective and Harmless Ownership Verification via Embedded External Features [109.19238806106426]
We propose an effective and harmless model ownership verification (MOVE) to defend against different types of model stealing simultaneously. We conduct the ownership verification by verifying whether a suspicious model contains the knowledge of defender-specified external features. In particular, we develop our MOVE method under both white-box and black-box settings to provide comprehensive model protection.
arXiv Detail & Related papers (2022-08-04T02:22:29Z)
Backdooring Explainable Machine Learning [0.8180960351554997]
We demonstrate blinding attacks that can fully disguise an ongoing attack against the machine learning model. Similar to neural backdoors, we modify the model's prediction upon trigger presence but simultaneously also fool the provided explanation.
arXiv Detail & Related papers (2022-04-20T14:40:09Z)
Towards A Conceptually Simple Defensive Approach for Few-shot classifiers Against Adversarial Support Samples [107.38834819682315]
We study a conceptually simple approach to defend few-shot classifiers against adversarial attacks. We propose a simple attack-agnostic detection method, using the concept of self-similarity and filtering. Our evaluation on the miniImagenet (MI) and CUB datasets exhibit good attack detection performance.
arXiv Detail & Related papers (2021-10-24T05:46:03Z)
Unsupervised Detection of Adversarial Examples with Model Explanations [0.6091702876917279]
We propose a simple yet effective method to detect adversarial examples using methods developed to explain the model's behavior. Our evaluations with MNIST handwritten dataset show that our method is capable of detecting adversarial examples with high confidence.
arXiv Detail & Related papers (2021-07-22T06:54:18Z)
Adversarial Examples Make Strong Poisons [55.63469396785909]
We show that adversarial examples, originally intended for attacking pre-trained models, are even more effective for data poisoning than recent methods designed specifically for poisoning. Our method, adversarial poisoning, is substantially more effective than existing poisoning methods for secure dataset release.
arXiv Detail & Related papers (2021-06-21T01:57:14Z)
Combating Adversaries with Anti-Adversaries [118.70141983415445]
In particular, our layer generates an input perturbation in the opposite direction of the adversarial one. We verify the effectiveness of our approach by combining our layer with both nominally and robustly trained models. Our anti-adversary layer significantly enhances model robustness while coming at no cost on clean accuracy.
arXiv Detail & Related papers (2021-03-26T09:36:59Z)
ExAD: An Ensemble Approach for Explanation-based Adversarial Detection [17.455233006559734]
We propose ExAD, a framework to detect adversarial examples using an ensemble of explanation techniques. We evaluate our approach using six state-of-the-art adversarial attacks on three image datasets.
arXiv Detail & Related papers (2021-03-22T00:53:07Z)
Explainable Adversarial Attacks in Deep Neural Networks Using Activation Profiles [69.9674326582747]
This paper presents a visual framework to investigate neural network models subjected to adversarial examples. We show how observing these elements can quickly pinpoint exploited areas in a model.
arXiv Detail & Related papers (2021-03-18T13:04:21Z)
Adversarial Feature Desensitization [12.401175943131268]
We propose a novel approach to adversarial robustness, which builds upon the insights from the domain adaptation field. Our method, called Adversarial Feature Desensitization (AFD), aims at learning features that are invariant towards adversarial perturbations of the inputs.
arXiv Detail & Related papers (2020-06-08T14:20:02Z)
Class-Aware Domain Adaptation for Improving Adversarial Robustness [27.24720754239852]
adversarial training has been proposed to train networks by injecting adversarial examples into the training data. We propose a novel Class-Aware Domain Adaptation (CADA) method for adversarial defense without directly applying adversarial training.
arXiv Detail & Related papers (2020-05-10T03:45:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.