Adversarial Attacks and Defenses: An Interpretation Perspective
- URL: http://arxiv.org/abs/2004.11488v2
- Date: Wed, 7 Oct 2020 15:43:26 GMT
- Title: Adversarial Attacks and Defenses: An Interpretation Perspective
- Authors: Ninghao Liu, Mengnan Du, Ruocheng Guo, Huan Liu, Xia Hu
- Abstract summary: We review recent work on adversarial attacks and defenses, particularly from the perspective of machine learning interpretation.
The goal of model interpretation, or interpretable machine learning, is to extract human-understandable terms for the working mechanism of models.
For each type of interpretation, we elaborate on how it could be used for adversarial attacks and defenses.
- Score: 80.23908920686625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the recent advances in a wide spectrum of applications, machine
learning models, especially deep neural networks, have been shown to be
vulnerable to adversarial attacks. Attackers add carefully-crafted
perturbations to input, where the perturbations are almost imperceptible to
humans, but can cause models to make wrong predictions. Techniques to protect
models against adversarial input are called adversarial defense methods.
Although many approaches have been proposed to study adversarial attacks and
defenses in different scenarios, an intriguing and crucial challenge remains
that how to really understand model vulnerability? Inspired by the saying that
"if you know yourself and your enemy, you need not fear the battles", we may
tackle the aforementioned challenge after interpreting machine learning models
to open the black-boxes. The goal of model interpretation, or interpretable
machine learning, is to extract human-understandable terms for the working
mechanism of models. Recently, some approaches start incorporating
interpretation into the exploration of adversarial attacks and defenses.
Meanwhile, we also observe that many existing methods of adversarial attacks
and defenses, although not explicitly claimed, can be understood from the
perspective of interpretation. In this paper, we review recent work on
adversarial attacks and defenses, particularly from the perspective of machine
learning interpretation. We categorize interpretation into two types,
feature-level interpretation and model-level interpretation. For each type of
interpretation, we elaborate on how it could be used for adversarial attacks
and defenses. We then briefly illustrate additional correlations between
interpretation and adversaries. Finally, we discuss the challenges and future
directions along tackling adversary issues with interpretation.
Related papers
- On the Difficulty of Defending Contrastive Learning against Backdoor
Attacks [58.824074124014224]
We show how contrastive backdoor attacks operate through distinctive mechanisms.
Our findings highlight the need for defenses tailored to the specificities of contrastive backdoor attacks.
arXiv Detail & Related papers (2023-12-14T15:54:52Z) - Investigating Human-Identifiable Features Hidden in Adversarial
Perturbations [54.39726653562144]
Our study explores up to five attack algorithms across three datasets.
We identify human-identifiable features in adversarial perturbations.
Using pixel-level annotations, we extract such features and demonstrate their ability to compromise target models.
arXiv Detail & Related papers (2023-09-28T22:31:29Z) - Analyzing the Impact of Adversarial Examples on Explainable Machine
Learning [0.31498833540989407]
Adversarial attacks are a type of attack on machine learning models where an attacker deliberately modifies the inputs to cause the model to make incorrect predictions.
Work on the vulnerability of deep learning models to adversarial attacks has shown that it is very easy to make samples that make a model predict things that it doesn't want to.
In this work, we analyze the impact of model interpretability due to adversarial attacks on text classification problems.
arXiv Detail & Related papers (2023-07-17T08:50:36Z) - Interpretations Cannot Be Trusted: Stealthy and Effective Adversarial
Perturbations against Interpretable Deep Learning [16.13790238416691]
This work introduces two attacks, AdvEdge and AdvEdge$+$, that deceive both the target deep learning model and the coupled interpretation model.
Our analysis shows the effectiveness of our attacks in terms of deceiving the deep learning models and their interpreters.
arXiv Detail & Related papers (2022-11-29T04:45:10Z) - Backdooring Explainable Machine Learning [0.8180960351554997]
We demonstrate blinding attacks that can fully disguise an ongoing attack against the machine learning model.
Similar to neural backdoors, we modify the model's prediction upon trigger presence but simultaneously also fool the provided explanation.
arXiv Detail & Related papers (2022-04-20T14:40:09Z) - Are socially-aware trajectory prediction models really socially-aware? [75.36961426916639]
We introduce a socially-attended attack to assess the social understanding of prediction models.
An attack is a small yet carefully-crafted perturbations to fail predictors.
We show that our attack can be employed to increase the social understanding of state-of-the-art models.
arXiv Detail & Related papers (2021-08-24T17:59:09Z) - Attack to Fool and Explain Deep Networks [59.97135687719244]
We counter-argue by providing evidence of human-meaningful patterns in adversarial perturbations.
Our major contribution is a novel pragmatic adversarial attack that is subsequently transformed into a tool to interpret the visual models.
arXiv Detail & Related papers (2021-06-20T03:07:36Z) - Proper Network Interpretability Helps Adversarial Robustness in
Classification [91.39031895064223]
We show that with a proper measurement of interpretation, it is difficult to prevent prediction-evasion adversarial attacks from causing interpretation discrepancy.
We develop an interpretability-aware defensive scheme built only on promoting robust interpretation.
We show that our defense achieves both robust classification and robust interpretation, outperforming state-of-the-art adversarial training methods against attacks of large perturbation.
arXiv Detail & Related papers (2020-06-26T01:31:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.