Second-Order NLP Adversarial Examples
- URL: http://arxiv.org/abs/2010.01770v2
- Date: Tue, 6 Oct 2020 01:20:53 GMT
- Title: Second-Order NLP Adversarial Examples
- Authors: John X. Morris
- Abstract summary: Adrial example generation methods rely on models like language models or sentence encoders to determine if potential adversarial examples are valid.
In these methods, a valid adversarial example fools the model being attacked, and is determined to be semantically or syntactically valid by a second model.
We contend that these adversarial examples may not be flaws in the attacked model, but flaws in the model that determines validity.
- Score: 0.18855270809505867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adversarial example generation methods in NLP rely on models like language
models or sentence encoders to determine if potential adversarial examples are
valid. In these methods, a valid adversarial example fools the model being
attacked, and is determined to be semantically or syntactically valid by a
second model. Research to date has counted all such examples as errors by the
attacked model. We contend that these adversarial examples may not be flaws in
the attacked model, but flaws in the model that determines validity. We term
such invalid inputs second-order adversarial examples. We propose the
constraint robustness curve and associated metric ACCS as tools for evaluating
the robustness of a constraint to second-order adversarial examples. To
generate this curve, we design an adversarial attack to run directly on the
semantic similarity models. We test on two constraints, the Universal Sentence
Encoder (USE) and BERTScore. Our findings indicate that such second-order
examples exist, but are typically less common than first-order adversarial
examples in state-of-the-art models. They also indicate that USE is effective
as constraint on NLP adversarial examples, while BERTScore is nearly
ineffectual. Code for running the experiments in this paper is available at
https://github.com/jxmorris12/second-order-adversarial-examples.
Related papers
- A Constraint-Enforcing Reward for Adversarial Attacks on Text Classifiers [10.063169009242682]
We train an encoder-decoder paraphrase model to generate adversarial examples.
We adopt a reinforcement learning algorithm and propose a constraint-enforcing reward.
We show how key design choices impact the generated examples and discuss the strengths and weaknesses of the proposed approach.
arXiv Detail & Related papers (2024-05-20T09:33:43Z) - Are aligned neural networks adversarially aligned? [93.91072860401856]
adversarial users can construct inputs which circumvent attempts at alignment.
We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models.
We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.
arXiv Detail & Related papers (2023-06-26T17:18:44Z) - Latent Feature Relation Consistency for Adversarial Robustness [80.24334635105829]
misclassification will occur when deep neural networks predict adversarial examples which add human-imperceptible adversarial noise to natural examples.
We propose textbfLatent textbfFeature textbfRelation textbfConsistency (textbfLFRC)
LFRC constrains the relation of adversarial examples in latent space to be consistent with the natural examples.
arXiv Detail & Related papers (2023-03-29T13:50:01Z) - On the Effect of Adversarial Training Against Invariance-based
Adversarial Examples [0.23624125155742057]
This work addresses the impact of adversarial training with invariance-based adversarial examples on a convolutional neural network (CNN)
We show that when adversarial training with invariance-based and perturbation-based adversarial examples is applied, it should be conducted simultaneously and not consecutively.
arXiv Detail & Related papers (2023-02-16T12:35:37Z) - Unrestricted Adversarial Samples Based on Non-semantic Feature Clusters
Substitution [1.8782750537161608]
We introduce "unrestricted" perturbations that create adversarial samples by using spurious relations learned by model training.
Specifically, we find feature clusters in non-semantic features that are strongly correlated with model judgment results.
We create adversarial samples by using them to replace the corresponding feature clusters in the target image.
arXiv Detail & Related papers (2022-08-31T07:42:36Z) - ADC: Adversarial attacks against object Detection that evade Context
consistency checks [55.8459119462263]
We show that even context consistency checks can be brittle to properly crafted adversarial examples.
We propose an adaptive framework to generate examples that subvert such defenses.
Our results suggest that how to robustly model context and check its consistency, is still an open problem.
arXiv Detail & Related papers (2021-10-24T00:25:09Z) - On the Transferability of Adversarial Attacksagainst Neural Text
Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models.
We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models.
We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z) - Differentiable Language Model Adversarial Attacks on Categorical
Sequence Classifiers [0.0]
An adversarial attack paradigm explores various scenarios for the vulnerability of deep learning models.
We use a fine-tuning of a language model for adversarial attacks as a generator of adversarial examples.
Our model works for diverse datasets on bank transactions, electronic health records, and NLP datasets.
arXiv Detail & Related papers (2020-06-19T11:25:36Z) - Robust and On-the-fly Dataset Denoising for Image Classification [72.10311040730815]
On-the-fly Data Denoising (ODD) is robust to mislabeled examples, while introducing almost zero computational overhead compared to standard training.
ODD is able to achieve state-of-the-art results on a wide range of datasets including real-world ones such as WebVision and Clothing1M.
arXiv Detail & Related papers (2020-03-24T03:59:26Z) - Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial
Perturbations [65.05561023880351]
Adversarial examples are malicious inputs crafted to induce misclassification.
This paper studies a complementary failure mode, invariance-based adversarial examples.
We show that defenses against sensitivity-based attacks actively harm a model's accuracy on invariance-based attacks.
arXiv Detail & Related papers (2020-02-11T18:50:23Z) - Generating Natural Adversarial Hyperspectral examples with a modified
Wasserstein GAN [0.0]
We present a new method which is able to generate natural adversarial examples from the true data following the second paradigm.
We provide a proof of concept of our method by generating adversarial hyperspectral signatures on a remote sensing dataset.
arXiv Detail & Related papers (2020-01-27T07:32:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.