Unsupervised Detection of Adversarial Examples with Model Explanations
- URL: http://arxiv.org/abs/2107.10480v1
- Date: Thu, 22 Jul 2021 06:54:18 GMT
- Title: Unsupervised Detection of Adversarial Examples with Model Explanations
- Authors: Gihyuk Ko, Gyumin Lim
- Abstract summary: We propose a simple yet effective method to detect adversarial examples using methods developed to explain the model's behavior.
Our evaluations with MNIST handwritten dataset show that our method is capable of detecting adversarial examples with high confidence.
- Score: 0.6091702876917279
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep Neural Networks (DNNs) have shown remarkable performance in a diverse
range of machine learning applications. However, it is widely known that DNNs
are vulnerable to simple adversarial perturbations, which causes the model to
incorrectly classify inputs. In this paper, we propose a simple yet effective
method to detect adversarial examples, using methods developed to explain the
model's behavior. Our key observation is that adding small, humanly
imperceptible perturbations can lead to drastic changes in the model
explanations, resulting in unusual or irregular forms of explanations. From
this insight, we propose an unsupervised detection of adversarial examples
using reconstructor networks trained only on model explanations of benign
examples. Our evaluations with MNIST handwritten dataset show that our method
is capable of detecting adversarial examples generated by the state-of-the-art
algorithms with high confidence. To the best of our knowledge, this work is the
first in suggesting unsupervised defense method using model explanations.
Related papers
- Manipulating Feature Visualizations with Gradient Slingshots [54.31109240020007]
We introduce a novel method for manipulating Feature Visualization (FV) without significantly impacting the model's decision-making process.
We evaluate the effectiveness of our method on several neural network models and demonstrate its capabilities to hide the functionality of arbitrarily chosen neurons.
arXiv Detail & Related papers (2024-01-11T18:57:17Z) - Adversarial Examples Detection with Enhanced Image Difference Features
based on Local Histogram Equalization [20.132066800052712]
We propose an adversarial example detection framework based on a high-frequency information enhancement strategy.
This framework can effectively extract and amplify the feature differences between adversarial examples and normal examples.
arXiv Detail & Related papers (2023-05-08T03:14:01Z) - On the Robustness of Explanations of Deep Neural Network Models: A
Survey [14.940679892694089]
We present a comprehensive survey of methods that study, understand, attack, and defend explanations of Deep Neural Network (DNN) models.
We also present a detailed review of different metrics used to evaluate explanation methods, as well as describe attributional attack and defense methods.
arXiv Detail & Related papers (2022-11-09T10:14:21Z) - ExAD: An Ensemble Approach for Explanation-based Adversarial Detection [17.455233006559734]
We propose ExAD, a framework to detect adversarial examples using an ensemble of explanation techniques.
We evaluate our approach using six state-of-the-art adversarial attacks on three image datasets.
arXiv Detail & Related papers (2021-03-22T00:53:07Z) - Explainable Adversarial Attacks in Deep Neural Networks Using Activation
Profiles [69.9674326582747]
This paper presents a visual framework to investigate neural network models subjected to adversarial examples.
We show how observing these elements can quickly pinpoint exploited areas in a model.
arXiv Detail & Related papers (2021-03-18T13:04:21Z) - Beyond Trivial Counterfactual Explanations with Diverse Valuable
Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction.
We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss.
Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z) - Adversarial Examples for Unsupervised Machine Learning Models [71.81480647638529]
Adrial examples causing evasive predictions are widely used to evaluate and improve the robustness of machine learning models.
We propose a framework of generating adversarial examples for unsupervised models and demonstrate novel applications to data augmentation.
arXiv Detail & Related papers (2021-03-02T17:47:58Z) - On the Transferability of Adversarial Attacksagainst Neural Text
Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models.
We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models.
We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z) - Explaining and Improving Model Behavior with k Nearest Neighbor
Representations [107.24850861390196]
We propose using k nearest neighbor representations to identify training examples responsible for a model's predictions.
We show that kNN representations are effective at uncovering learned spurious associations.
Our results indicate that the kNN approach makes the finetuned model more robust to adversarial inputs.
arXiv Detail & Related papers (2020-10-18T16:55:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.