Related papers: Proper Network Interpretability Helps Adversarial Robustness in Classification

Proper Network Interpretability Helps Adversarial Robustness in Classification

URL: http://arxiv.org/abs/2006.14748v2
Date: Wed, 21 Oct 2020 18:56:05 GMT
Title: Proper Network Interpretability Helps Adversarial Robustness in Classification
Authors: Akhilan Boopathy, Sijia Liu, Gaoyuan Zhang, Cynthia Liu, Pin-Yu Chen, Shiyu Chang, Luca Daniel
Abstract summary: We show that with a proper measurement of interpretation, it is difficult to prevent prediction-evasion adversarial attacks from causing interpretation discrepancy. We develop an interpretability-aware defensive scheme built only on promoting robust interpretation. We show that our defense achieves both robust classification and robust interpretation, outperforming state-of-the-art adversarial training methods against attacks of large perturbation.
Score: 91.39031895064223
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent works have empirically shown that there exist adversarial examples that can be hidden from neural network interpretability (namely, making network interpretation maps visually similar), or interpretability is itself susceptible to adversarial attacks. In this paper, we theoretically show that with a proper measurement of interpretation, it is actually difficult to prevent prediction-evasion adversarial attacks from causing interpretation discrepancy, as confirmed by experiments on MNIST, CIFAR-10 and Restricted ImageNet. Spurred by that, we develop an interpretability-aware defensive scheme built only on promoting robust interpretation (without the need for resorting to adversarial loss minimization). We show that our defense achieves both robust classification and robust interpretation, outperforming state-of-the-art adversarial training methods against attacks of large perturbation in particular.

Related papers

Birds look like cars: Adversarial analysis of intrinsically interpretable deep learning [9.769695768744421]
We highlight the risks related to overreliance and susceptibility to adversarial manipulation of "intrinsically" interpretable models by design. Fooling the model's reasoning by exploiting its use of latent prototypes manifests the inherent uninterpretability of deep neural networks. The reported limitations of prototype-based networks put their trustworthiness and applicability into question.
arXiv Detail & Related papers (2025-03-11T17:24:33Z)
Detecting Adversarial Attacks in Semantic Segmentation via Uncertainty Estimation: A Deep Analysis [12.133306321357999]
We propose an uncertainty-based method for detecting adversarial attacks on neural networks for semantic segmentation. We conduct a detailed analysis of uncertainty-based detection of adversarial attacks and various state-of-the-art neural networks. Our numerical experiments show the effectiveness of the proposed uncertainty-based detection method.
arXiv Detail & Related papers (2024-08-19T14:13:30Z)
Uncertainty-based Detection of Adversarial Attacks in Semantic Segmentation [16.109860499330562]
We introduce an uncertainty-based approach for the detection of adversarial attacks in semantic segmentation. We demonstrate the ability of our approach to detect perturbed images across multiple types of adversarial attacks.
arXiv Detail & Related papers (2023-05-22T08:36:35Z)
Interpretability is a Kind of Safety: An Interpreter-based Ensemble for Adversary Defense [28.398901783858005]
We propose an interpreter-based ensemble framework called X-Ensemble for robust defense adversary. X-Ensemble employs the Random Forests (RF) model to combine sub-detectors into an ensemble detector for adversarial hybrid attacks defense.
arXiv Detail & Related papers (2023-04-14T04:32:06Z)
Adversarial Visual Robustness by Causal Intervention [56.766342028800445]
Adversarial training is the de facto most promising defense against adversarial examples. Yet, its passive nature inevitably prevents it from being immune to unknown attackers. We provide a causal viewpoint of adversarial vulnerability: the cause is the confounder ubiquitously existing in learning.
arXiv Detail & Related papers (2021-06-17T14:23:54Z)
Combating Adversaries with Anti-Adversaries [118.70141983415445]
In particular, our layer generates an input perturbation in the opposite direction of the adversarial one. We verify the effectiveness of our approach by combining our layer with both nominally and robustly trained models. Our anti-adversary layer significantly enhances model robustness while coming at no cost on clean accuracy.
arXiv Detail & Related papers (2021-03-26T09:36:59Z)
Resilience of Bayesian Layer-Wise Explanations under Adversarial Attacks [3.222802562733787]
We show that for deterministic Neural Networks, saliency interpretations are remarkably brittle even when the attacks fail. We suggest and demonstrate empirically that saliency explanations provided by Bayesian Neural Networks are considerably more stable under adversarial perturbations.
arXiv Detail & Related papers (2021-02-22T14:07:24Z)
Learning to Separate Clusters of Adversarial Representations for Robust Adversarial Detection [50.03939695025513]
We propose a new probabilistic adversarial detector motivated by a recently introduced non-robust feature. In this paper, we consider the non-robust features as a common property of adversarial examples, and we deduce it is possible to find a cluster in representation space corresponding to the property. This idea leads us to probability estimate distribution of adversarial representations in a separate cluster, and leverage the distribution for a likelihood based adversarial detector.
arXiv Detail & Related papers (2020-12-07T07:21:18Z)
A Self-supervised Approach for Adversarial Robustness [105.88250594033053]
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems. This paper proposes a self-supervised adversarial training mechanism in the input space. It provides significant robustness against the textbfunseen adversarial attacks.
arXiv Detail & Related papers (2020-06-08T20:42:39Z)
Adversarial Attacks and Defenses: An Interpretation Perspective [80.23908920686625]
We review recent work on adversarial attacks and defenses, particularly from the perspective of machine learning interpretation. The goal of model interpretation, or interpretable machine learning, is to extract human-understandable terms for the working mechanism of models. For each type of interpretation, we elaborate on how it could be used for adversarial attacks and defenses.
arXiv Detail & Related papers (2020-04-23T23:19:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.