Robust Feature-Level Adversaries are Interpretability Tools
- URL: http://arxiv.org/abs/2110.03605v7
- Date: Mon, 11 Sep 2023 16:31:55 GMT
- Title: Robust Feature-Level Adversaries are Interpretability Tools
- Authors: Stephen Casper, Max Nadeau, Dylan Hadfield-Menell, Gabriel Kreiman
- Abstract summary: Recent work that manipulates latent representations of image generators to create "feature-level" adversarial perturbations gives us an opportunity to explore perceptible, interpretable adversarial attacks.
We show that these adversaries are uniquely versatile and highly robust.
They can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale.
- Score: 17.72884349429452
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The literature on adversarial attacks in computer vision typically focuses on
pixel-level perturbations. These tend to be very difficult to interpret. Recent
work that manipulates the latent representations of image generators to create
"feature-level" adversarial perturbations gives us an opportunity to explore
perceptible, interpretable adversarial attacks. We make three contributions.
First, we observe that feature-level attacks provide useful classes of inputs
for studying representations in models. Second, we show that these adversaries
are uniquely versatile and highly robust. We demonstrate that they can be used
to produce targeted, universal, disguised, physically-realizable, and black-box
attacks at the ImageNet scale. Third, we show how these adversarial images can
be used as a practical interpretability tool for identifying bugs in networks.
We use these adversaries to make predictions about spurious associations
between features and classes which we then test by designing "copy/paste"
attacks in which one natural image is pasted into another to cause a targeted
misclassification. Our results suggest that feature-level attacks are a
promising approach for rigorous interpretability research. They support the
design of tools to better understand what a model has learned and diagnose
brittle feature associations. Code is available at
https://github.com/thestephencasper/feature_level_adv
Related papers
- Any Target Can be Offense: Adversarial Example Generation via Generalized Latent Infection [83.72430401516674]
GAKer is able to construct adversarial examples to any target class.
Our method achieves an approximately $14.13%$ higher attack success rate for unknown classes.
arXiv Detail & Related papers (2024-07-17T03:24:09Z) - Improving Adversarial Robustness via Decoupled Visual Representation Masking [65.73203518658224]
In this paper, we highlight two novel properties of robust features from the feature distribution perspective.
We find that state-of-the-art defense methods aim to address both of these mentioned issues well.
Specifically, we propose a simple but effective defense based on decoupled visual representation masking.
arXiv Detail & Related papers (2024-06-16T13:29:41Z) - Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation [52.72682366640554]
Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else.
It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author.
arXiv Detail & Related papers (2024-03-17T16:36:26Z) - Counterfactual Image Generation for adversarially robust and
interpretable Classifiers [1.3859669037499769]
We propose a unified framework leveraging image-to-image translation Generative Adrial Networks (GANs) to produce counterfactual samples.
This is achieved by combining the classifier and discriminator into a single model that attributes real images to their respective classes and flags generated images as "fake"
We show how the model exhibits improved robustness to adversarial attacks, and we show how the discriminator's "fakeness" value serves as an uncertainty measure of the predictions.
arXiv Detail & Related papers (2023-10-01T18:50:29Z) - Investigating Human-Identifiable Features Hidden in Adversarial
Perturbations [54.39726653562144]
Our study explores up to five attack algorithms across three datasets.
We identify human-identifiable features in adversarial perturbations.
Using pixel-level annotations, we extract such features and demonstrate their ability to compromise target models.
arXiv Detail & Related papers (2023-09-28T22:31:29Z) - Towards A Conceptually Simple Defensive Approach for Few-shot
classifiers Against Adversarial Support Samples [107.38834819682315]
We study a conceptually simple approach to defend few-shot classifiers against adversarial attacks.
We propose a simple attack-agnostic detection method, using the concept of self-similarity and filtering.
Our evaluation on the miniImagenet (MI) and CUB datasets exhibit good attack detection performance.
arXiv Detail & Related papers (2021-10-24T05:46:03Z) - Adversarial examples by perturbing high-level features in intermediate
decoder layers [0.0]
Instead of perturbing pixels, we use an encoder-decoder representation of the input image and perturb intermediate layers in the decoder.
Our perturbation possesses semantic meaning, such as a longer beak or green tints.
We show that our method modifies key features such as edges and that defence techniques based on adversarial training are vulnerable to our attacks.
arXiv Detail & Related papers (2021-10-14T07:08:15Z) - Attack to Fool and Explain Deep Networks [59.97135687719244]
We counter-argue by providing evidence of human-meaningful patterns in adversarial perturbations.
Our major contribution is a novel pragmatic adversarial attack that is subsequently transformed into a tool to interpret the visual models.
arXiv Detail & Related papers (2021-06-20T03:07:36Z) - AdvFlow: Inconspicuous Black-box Adversarial Attacks using Normalizing
Flows [11.510009152620666]
We introduce AdvFlow: a novel black-box adversarial attack method on image classifiers.
We see that the proposed method generates adversaries that closely follow the clean data distribution, a property which makes their detection less likely.
arXiv Detail & Related papers (2020-07-15T02:13:49Z) - Generating Semantic Adversarial Examples via Feature Manipulation [23.48763375455514]
We propose a more practical adversarial attack by designing structured perturbation with semantic meanings.
Our proposed technique manipulates the semantic attributes of images via the disentangled latent codes.
We demonstrate the existence of a universal, image-agnostic semantic adversarial example.
arXiv Detail & Related papers (2020-01-06T06:28:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.