Concept-based Adversarial Attacks: Tricking Humans and Classifiers Alike
- URL: http://arxiv.org/abs/2203.10166v1
- Date: Fri, 18 Mar 2022 21:30:11 GMT
- Title: Concept-based Adversarial Attacks: Tricking Humans and Classifiers Alike
- Authors: Johannes Schneider and Giovanni Apruzzese
- Abstract summary: We generate adversarial samples by modifying activations of upper layers encoding semantically meaningful concepts.
A human might (and possibly should) notice differences between the original and the adversarial sample.
Our approach is relevant in, e.g., multi-stage processing of inputs, where both humans and machines are involved in decision-making.
- Score: 4.578929995816155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose to generate adversarial samples by modifying activations of upper
layers encoding semantically meaningful concepts. The original sample is
shifted towards a target sample, yielding an adversarial sample, by using the
modified activations to reconstruct the original sample. A human might (and
possibly should) notice differences between the original and the adversarial
sample. Depending on the attacker-provided constraints, an adversarial sample
can exhibit subtle differences or appear like a "forged" sample from another
class. Our approach and goal are in stark contrast to common attacks involving
perturbations of single pixels that are not recognizable by humans. Our
approach is relevant in, e.g., multi-stage processing of inputs, where both
humans and machines are involved in decision-making because invisible
perturbations will not fool a human. Our evaluation focuses on deep neural
networks. We also show the transferability of our adversarial examples among
networks.
Related papers
- Imperceptible Face Forgery Attack via Adversarial Semantic Mask [59.23247545399068]
We propose an Adversarial Semantic Mask Attack framework (ASMA) which can generate adversarial examples with good transferability and invisibility.
Specifically, we propose a novel adversarial semantic mask generative model, which can constrain generated perturbations in local semantic regions for good stealthiness.
arXiv Detail & Related papers (2024-06-16T10:38:11Z) - Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation [52.72682366640554]
Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else.
It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author.
arXiv Detail & Related papers (2024-03-17T16:36:26Z) - Breaking Free: How to Hack Safety Guardrails in Black-Box Diffusion Models! [52.0855711767075]
EvoSeed is an evolutionary strategy-based algorithmic framework for generating photo-realistic natural adversarial samples.
We employ CMA-ES to optimize the search for an initial seed vector, which, when processed by the Conditional Diffusion Model, results in the natural adversarial sample misclassified by the Model.
Experiments show that generated adversarial images are of high image quality, raising concerns about generating harmful content bypassing safety classifiers.
arXiv Detail & Related papers (2024-02-07T09:39:29Z) - On the Effect of Adversarial Training Against Invariance-based
Adversarial Examples [0.23624125155742057]
This work addresses the impact of adversarial training with invariance-based adversarial examples on a convolutional neural network (CNN)
We show that when adversarial training with invariance-based and perturbation-based adversarial examples is applied, it should be conducted simultaneously and not consecutively.
arXiv Detail & Related papers (2023-02-16T12:35:37Z) - Inference Time Evidences of Adversarial Attacks for Forensic on
Transformers [27.88746727644074]
Vision Transformers (ViTs) are becoming a popular paradigm for vision tasks as they achieve state-of-the-art performance on image classification.
This paper presents our first attempt toward detecting adversarial attacks during inference time using the network's input and outputs as well as latent features.
arXiv Detail & Related papers (2023-01-31T01:17:03Z) - Pixle: a fast and effective black-box attack based on rearranging pixels [15.705568893476947]
Black-box adversarial attacks can be performed without knowing the inner structure of the attacked model.
We propose a novel attack that is capable of correctly attacking a high percentage of samples by rearranging a small number of pixels within the attacked image.
We demonstrate that our attack works on a large number of datasets and models, that it requires a small number of iterations, and that the distance between the original sample and the adversarial one is negligible to the human eye.
arXiv Detail & Related papers (2022-02-04T17:03:32Z) - Identification of Attack-Specific Signatures in Adversarial Examples [62.17639067715379]
We show that different attack algorithms produce adversarial examples which are distinct not only in their effectiveness but also in how they qualitatively affect their victims.
Our findings suggest that prospective adversarial attacks should be compared not only via their success rates at fooling models but also via deeper downstream effects they have on victims.
arXiv Detail & Related papers (2021-10-13T15:40:48Z) - Normal vs. Adversarial: Salience-based Analysis of Adversarial Samples
for Relation Extraction [25.869746965410954]
We take the first step to leverage the salience-based method to analyze adversarial samples.
We observe that salience tokens have a direct correlation with adversarial perturbations.
To some extent, our approach unveils the characters against adversarial samples.
arXiv Detail & Related papers (2021-04-01T07:36:04Z) - Learning to Separate Clusters of Adversarial Representations for Robust
Adversarial Detection [50.03939695025513]
We propose a new probabilistic adversarial detector motivated by a recently introduced non-robust feature.
In this paper, we consider the non-robust features as a common property of adversarial examples, and we deduce it is possible to find a cluster in representation space corresponding to the property.
This idea leads us to probability estimate distribution of adversarial representations in a separate cluster, and leverage the distribution for a likelihood based adversarial detector.
arXiv Detail & Related papers (2020-12-07T07:21:18Z) - On the Transferability of Adversarial Attacksagainst Neural Text
Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models.
We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models.
We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z) - AdvJND: Generating Adversarial Examples with Just Noticeable Difference [3.638233924421642]
Adding small perturbations on examples causes a good-performance model to misclassify the crafted examples.
Adversarial examples generated by our AdvJND algorithm yield distributions similar to those of the original inputs.
arXiv Detail & Related papers (2020-02-01T09:55:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.