Interpretations Cannot Be Trusted: Stealthy and Effective Adversarial
Perturbations against Interpretable Deep Learning
- URL: http://arxiv.org/abs/2211.15926v1
- Date: Tue, 29 Nov 2022 04:45:10 GMT
- Title: Interpretations Cannot Be Trusted: Stealthy and Effective Adversarial
Perturbations against Interpretable Deep Learning
- Authors: Eldor Abdukhamidov, Mohammed Abuhamad, Simon S. Woo, Eric Chan-Tin,
Tamer Abuhmed
- Abstract summary: This work introduces two attacks, AdvEdge and AdvEdge$+$, that deceive both the target deep learning model and the coupled interpretation model.
Our analysis shows the effectiveness of our attacks in terms of deceiving the deep learning models and their interpreters.
- Score: 16.13790238416691
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Deep learning methods have gained increased attention in various applications
due to their outstanding performance. For exploring how this high performance
relates to the proper use of data artifacts and the accurate problem
formulation of a given task, interpretation models have become a crucial
component in developing deep learning-based systems. Interpretation models
enable the understanding of the inner workings of deep learning models and
offer a sense of security in detecting the misuse of artifacts in the input
data. Similar to prediction models, interpretation models are also susceptible
to adversarial inputs. This work introduces two attacks, AdvEdge and
AdvEdge$^{+}$, that deceive both the target deep learning model and the coupled
interpretation model. We assess the effectiveness of proposed attacks against
two deep learning model architectures coupled with four interpretation models
that represent different categories of interpretation models. Our experiments
include the attack implementation using various attack frameworks. We also
explore the potential countermeasures against such attacks. Our analysis shows
the effectiveness of our attacks in terms of deceiving the deep learning models
and their interpreters, and highlights insights to improve and circumvent the
attacks.
Related papers
- Analyzing the Impact of Adversarial Examples on Explainable Machine
Learning [0.31498833540989407]
Adversarial attacks are a type of attack on machine learning models where an attacker deliberately modifies the inputs to cause the model to make incorrect predictions.
Work on the vulnerability of deep learning models to adversarial attacks has shown that it is very easy to make samples that make a model predict things that it doesn't want to.
In this work, we analyze the impact of model interpretability due to adversarial attacks on text classification problems.
arXiv Detail & Related papers (2023-07-17T08:50:36Z) - Deviations in Representations Induced by Adversarial Attacks [0.0]
Research has shown that deep learning models are vulnerable to adversarial attacks.
This finding brought about a new direction in research, whereby algorithms were developed to attack and defend vulnerable networks.
We present a method for measuring and analyzing the deviations in representations induced by adversarial attacks.
arXiv Detail & Related papers (2022-11-07T17:40:08Z) - Improving robustness of jet tagging algorithms with adversarial training [56.79800815519762]
We investigate the vulnerability of flavor tagging algorithms via application of adversarial attacks.
We present an adversarial training strategy that mitigates the impact of such simulated attacks.
arXiv Detail & Related papers (2022-03-25T19:57:19Z) - Delving into Data: Effectively Substitute Training for Black-box Attack [84.85798059317963]
We propose a novel perspective substitute training that focuses on designing the distribution of data used in the knowledge stealing process.
The combination of these two modules can further boost the consistency of the substitute model and target model, which greatly improves the effectiveness of adversarial attack.
arXiv Detail & Related papers (2021-04-26T07:26:29Z) - Evaluating Deception Detection Model Robustness To Linguistic Variation [10.131671217810581]
We propose an analysis of model robustness against linguistic variation in the setting of deceptive news detection.
We consider two prediction tasks and compare three state-of-the-art embeddings to highlight consistent trends in model performance.
We find that character or mixed ensemble models are the most effective defenses and that character perturbation-based attack tactics are more successful.
arXiv Detail & Related papers (2021-04-23T17:25:38Z) - Explainable Adversarial Attacks in Deep Neural Networks Using Activation
Profiles [69.9674326582747]
This paper presents a visual framework to investigate neural network models subjected to adversarial examples.
We show how observing these elements can quickly pinpoint exploited areas in a model.
arXiv Detail & Related papers (2021-03-18T13:04:21Z) - ML-Doctor: Holistic Risk Assessment of Inference Attacks Against Machine
Learning Models [64.03398193325572]
Inference attacks against Machine Learning (ML) models allow adversaries to learn about training data, model parameters, etc.
We concentrate on four attacks - namely, membership inference, model inversion, attribute inference, and model stealing.
Our analysis relies on a modular re-usable software, ML-Doctor, which enables ML model owners to assess the risks of deploying their models.
arXiv Detail & Related papers (2021-02-04T11:35:13Z) - Learning to Attack: Towards Textual Adversarial Attacking in Real-world
Situations [81.82518920087175]
Adversarial attacking aims to fool deep neural networks with adversarial examples.
We propose a reinforcement learning based attack model, which can learn from attack history and launch attacks more efficiently.
arXiv Detail & Related papers (2020-09-19T09:12:24Z) - Evaluating Neural Machine Comprehension Model Robustness to Noisy Inputs
and Adversarial Attacks [9.36331571226256]
We evaluate machine comprehension models' robustness to noise and adversarial attacks by performing novel perturbations at the character, word, and sentence level.
We develop a model to predict model errors during adversarial attacks.
arXiv Detail & Related papers (2020-05-01T03:05:43Z) - Plausible Counterfactuals: Auditing Deep Learning Classifiers with
Realistic Adversarial Examples [84.8370546614042]
Black-box nature of Deep Learning models has posed unanswered questions about what they learn from data.
Generative Adversarial Network (GAN) and multi-objectives are used to furnish a plausible attack to the audited model.
Its utility is showcased within a human face classification task, unveiling the enormous potential of the proposed framework.
arXiv Detail & Related papers (2020-03-25T11:08:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.