Smoothed Geometry for Robust Attribution
- URL: http://arxiv.org/abs/2006.06643v2
- Date: Thu, 22 Oct 2020 17:55:35 GMT
- Title: Smoothed Geometry for Robust Attribution
- Authors: Zifan Wang, Haofan Wang, Shakul Ramkumar, Matt Fredrikson, Piotr
Mardziel and Anupam Datta
- Abstract summary: Feature attributions are a popular tool for explaining behavior of Deep Neural Networks (DNNs)
They have been shown to be vulnerable to attacks that produce divergent explanations for nearby inputs.
This lack of robustness is especially problematic in high-stakes applications where adversarially-manipulated explanations could impair safety and trustworthiness.
- Score: 36.616902063693104
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Feature attributions are a popular tool for explaining the behavior of Deep
Neural Networks (DNNs), but have recently been shown to be vulnerable to
attacks that produce divergent explanations for nearby inputs. This lack of
robustness is especially problematic in high-stakes applications where
adversarially-manipulated explanations could impair safety and trustworthiness.
Building on a geometric understanding of these attacks presented in recent
work, we identify Lipschitz continuity conditions on models' gradient that lead
to robust gradient-based attributions, and observe that smoothness may also be
related to the ability of an attack to transfer across multiple attribution
methods. To mitigate these attacks in practice, we propose an inexpensive
regularization method that promotes these conditions in DNNs, as well as a
stochastic smoothing technique that does not require re-training. Our
experiments on a range of image models demonstrate that both of these
mitigations consistently improve attribution robustness, and confirm the role
that smooth geometry plays in these attacks on real, large-scale models.
Related papers
- Improving Adversarial Robustness via Feature Pattern Consistency Constraint [42.50500608175905]
Convolutional Neural Networks (CNNs) are well-known for their vulnerability to adversarial attacks, posing significant security concerns.
Most existing methods either focus on learning from adversarial perturbations, leading to overfitting to the adversarial examples, or aim to eliminate such perturbations during inference.
We introduce a novel and effective Feature Pattern Consistency Constraint (FPCC) method to reinforce the latent feature's capacity to maintain the correct feature pattern.
arXiv Detail & Related papers (2024-06-13T05:38:30Z) - BadGD: A unified data-centric framework to identify gradient descent vulnerabilities [10.996626204702189]
BadGD sets a new standard for understanding and mitigating adversarial manipulations.
This research underscores the severe threats posed by such data-centric attacks and highlights the urgent need for robust defenses in machine learning.
arXiv Detail & Related papers (2024-05-24T23:39:45Z) - Extreme Miscalibration and the Illusion of Adversarial Robustness [66.29268991629085]
Adversarial Training is often used to increase model robustness.
We show that this observed gain in robustness is an illusion of robustness (IOR)
We urge the NLP community to incorporate test-time temperature scaling into their robustness evaluations.
arXiv Detail & Related papers (2024-02-27T13:49:12Z) - Improving Adversarial Robustness to Sensitivity and Invariance Attacks
with Deep Metric Learning [80.21709045433096]
A standard method in adversarial robustness assumes a framework to defend against samples crafted by minimally perturbing a sample.
We use metric learning to frame adversarial regularization as an optimal transport problem.
Our preliminary results indicate that regularizing over invariant perturbations in our framework improves both invariant and sensitivity defense.
arXiv Detail & Related papers (2022-11-04T13:54:02Z) - What Does the Gradient Tell When Attacking the Graph Structure [44.44204591087092]
We present a theoretical demonstration revealing that attackers tend to increase inter-class edges due to the message passing mechanism of GNNs.
By connecting dissimilar nodes, attackers can more effectively corrupt node features, making such attacks more advantageous.
We propose an innovative attack loss that balances attack effectiveness and imperceptibility, sacrificing some attack effectiveness to attain greater imperceptibility.
arXiv Detail & Related papers (2022-08-26T15:45:20Z) - Threat Model-Agnostic Adversarial Defense using Diffusion Models [14.603209216642034]
Deep Neural Networks (DNNs) are highly sensitive to imperceptible malicious perturbations, known as adversarial attacks.
Deep Neural Networks (DNNs) are highly sensitive to imperceptible malicious perturbations, known as adversarial attacks.
arXiv Detail & Related papers (2022-07-17T06:50:48Z) - Improving robustness of jet tagging algorithms with adversarial training [56.79800815519762]
We investigate the vulnerability of flavor tagging algorithms via application of adversarial attacks.
We present an adversarial training strategy that mitigates the impact of such simulated attacks.
arXiv Detail & Related papers (2022-03-25T19:57:19Z) - Policy Smoothing for Provably Robust Reinforcement Learning [109.90239627115336]
We study the provable robustness of reinforcement learning against norm-bounded adversarial perturbations of the inputs.
We generate certificates that guarantee that the total reward obtained by the smoothed policy will not fall below a certain threshold under a norm-bounded adversarial of perturbation the input.
arXiv Detail & Related papers (2021-06-21T21:42:08Z) - Adaptive Feature Alignment for Adversarial Training [56.17654691470554]
CNNs are typically vulnerable to adversarial attacks, which pose a threat to security-sensitive applications.
We propose the adaptive feature alignment (AFA) to generate features of arbitrary attacking strengths.
Our method is trained to automatically align features of arbitrary attacking strength.
arXiv Detail & Related papers (2021-05-31T17:01:05Z) - Detection Defense Against Adversarial Attacks with Saliency Map [7.736844355705379]
It is well established that neural networks are vulnerable to adversarial examples, which are almost imperceptible on human vision.
Existing defenses are trend to harden the robustness of models against adversarial attacks.
We propose a novel method combined with additional noises and utilize the inconsistency strategy to detect adversarial examples.
arXiv Detail & Related papers (2020-09-06T13:57:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.