Related papers: Adversarial Attacks and Defenses: An Interpretation Perspective

Adversarial Attacks and Defenses: An Interpretation Perspective

URL: http://arxiv.org/abs/2004.11488v2
Date: Wed, 7 Oct 2020 15:43:26 GMT
Title: Adversarial Attacks and Defenses: An Interpretation Perspective
Authors: Ninghao Liu, Mengnan Du, Ruocheng Guo, Huan Liu, Xia Hu
Abstract summary: We review recent work on adversarial attacks and defenses, particularly from the perspective of machine learning interpretation. The goal of model interpretation, or interpretable machine learning, is to extract human-understandable terms for the working mechanism of models. For each type of interpretation, we elaborate on how it could be used for adversarial attacks and defenses.
Score: 80.23908920686625
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the recent advances in a wide spectrum of applications, machine learning models, especially deep neural networks, have been shown to be vulnerable to adversarial attacks. Attackers add carefully-crafted perturbations to input, where the perturbations are almost imperceptible to humans, but can cause models to make wrong predictions. Techniques to protect models against adversarial input are called adversarial defense methods. Although many approaches have been proposed to study adversarial attacks and defenses in different scenarios, an intriguing and crucial challenge remains that how to really understand model vulnerability? Inspired by the saying that "if you know yourself and your enemy, you need not fear the battles", we may tackle the aforementioned challenge after interpreting machine learning models to open the black-boxes. The goal of model interpretation, or interpretable machine learning, is to extract human-understandable terms for the working mechanism of models. Recently, some approaches start incorporating interpretation into the exploration of adversarial attacks and defenses. Meanwhile, we also observe that many existing methods of adversarial attacks and defenses, although not explicitly claimed, can be understood from the perspective of interpretation. In this paper, we review recent work on adversarial attacks and defenses, particularly from the perspective of machine learning interpretation. We categorize interpretation into two types, feature-level interpretation and model-level interpretation. For each type of interpretation, we elaborate on how it could be used for adversarial attacks and defenses. We then briefly illustrate additional correlations between interpretation and adversaries. Finally, we discuss the challenges and future directions along tackling adversary issues with interpretation.

Related papers

Birds look like cars: Adversarial analysis of intrinsically interpretable deep learning [9.769695768744421]
We highlight the risks related to overreliance and susceptibility to adversarial manipulation of "intrinsically" interpretable models by design. Fooling the model's reasoning by exploiting its use of latent prototypes manifests the inherent uninterpretability of deep neural networks. The reported limitations of prototype-based networks put their trustworthiness and applicability into question.
arXiv Detail & Related papers (2025-03-11T17:24:33Z)
On the Difficulty of Defending Contrastive Learning against Backdoor Attacks [58.824074124014224]
We show how contrastive backdoor attacks operate through distinctive mechanisms. Our findings highlight the need for defenses tailored to the specificities of contrastive backdoor attacks.
arXiv Detail & Related papers (2023-12-14T15:54:52Z)
Investigating Human-Identifiable Features Hidden in Adversarial Perturbations [54.39726653562144]
Our study explores up to five attack algorithms across three datasets. We identify human-identifiable features in adversarial perturbations. Using pixel-level annotations, we extract such features and demonstrate their ability to compromise target models.
arXiv Detail & Related papers (2023-09-28T22:31:29Z)
Analyzing the Impact of Adversarial Examples on Explainable Machine Learning [0.31498833540989407]
Adversarial attacks are a type of attack on machine learning models where an attacker deliberately modifies the inputs to cause the model to make incorrect predictions. Work on the vulnerability of deep learning models to adversarial attacks has shown that it is very easy to make samples that make a model predict things that it doesn't want to. In this work, we analyze the impact of model interpretability due to adversarial attacks on text classification problems.
arXiv Detail & Related papers (2023-07-17T08:50:36Z)
Interpretations Cannot Be Trusted: Stealthy and Effective Adversarial Perturbations against Interpretable Deep Learning [16.13790238416691]
This work introduces two attacks, AdvEdge and AdvEdge$+$, that deceive both the target deep learning model and the coupled interpretation model. Our analysis shows the effectiveness of our attacks in terms of deceiving the deep learning models and their interpreters.
arXiv Detail & Related papers (2022-11-29T04:45:10Z)
Backdooring Explainable Machine Learning [0.8180960351554997]
We demonstrate blinding attacks that can fully disguise an ongoing attack against the machine learning model. Similar to neural backdoors, we modify the model's prediction upon trigger presence but simultaneously also fool the provided explanation.
arXiv Detail & Related papers (2022-04-20T14:40:09Z)
Are socially-aware trajectory prediction models really socially-aware? [75.36961426916639]
We introduce a socially-attended attack to assess the social understanding of prediction models. An attack is a small yet carefully-crafted perturbations to fail predictors. We show that our attack can be employed to increase the social understanding of state-of-the-art models.
arXiv Detail & Related papers (2021-08-24T17:59:09Z)
Attack to Fool and Explain Deep Networks [59.97135687719244]
We counter-argue by providing evidence of human-meaningful patterns in adversarial perturbations. Our major contribution is a novel pragmatic adversarial attack that is subsequently transformed into a tool to interpret the visual models.
arXiv Detail & Related papers (2021-06-20T03:07:36Z)
Proper Network Interpretability Helps Adversarial Robustness in Classification [91.39031895064223]
We show that with a proper measurement of interpretation, it is difficult to prevent prediction-evasion adversarial attacks from causing interpretation discrepancy. We develop an interpretability-aware defensive scheme built only on promoting robust interpretation. We show that our defense achieves both robust classification and robust interpretation, outperforming state-of-the-art adversarial training methods against attacks of large perturbation.
arXiv Detail & Related papers (2020-06-26T01:31:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.