NaturalAdversaries: Can Naturalistic Adversaries Be as Effective as
Artificial Adversaries?
- URL: http://arxiv.org/abs/2211.04364v1
- Date: Tue, 8 Nov 2022 16:37:34 GMT
- Title: NaturalAdversaries: Can Naturalistic Adversaries Be as Effective as
Artificial Adversaries?
- Authors: Saadia Gabriel, Hamid Palangi, Yejin Choi
- Abstract summary: We introduce a two-stage adversarial example generation framework (NaturalAdversaries) for natural language understanding tasks.
It is adaptable to both black-box and white-box adversarial attacks based on the level of access to the model parameters.
Our results indicate these adversaries generalize across domains, and offer insights for future research on improving robustness of neural text classification models.
- Score: 61.58261351116679
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While a substantial body of prior work has explored adversarial example
generation for natural language understanding tasks, these examples are often
unrealistic and diverge from the real-world data distributions. In this work,
we introduce a two-stage adversarial example generation framework
(NaturalAdversaries), for designing adversaries that are effective at fooling a
given classifier and demonstrate natural-looking failure cases that could
plausibly occur during in-the-wild deployment of the models.
At the first stage a token attribution method is used to summarize a given
classifier's behaviour as a function of the key tokens in the input. In the
second stage a generative model is conditioned on the key tokens from the first
stage. NaturalAdversaries is adaptable to both black-box and white-box
adversarial attacks based on the level of access to the model parameters. Our
results indicate these adversaries generalize across domains, and offer
insights for future research on improving robustness of neural text
classification models.
Related papers
- Counterfactual Generation from Language Models [64.55296662926919]
We show that counterfactual reasoning is conceptually distinct from interventions.
We propose a framework for generating true string counterfactuals.
Our experiments demonstrate that the approach produces meaningful counterfactuals.
arXiv Detail & Related papers (2024-11-11T17:57:30Z) - SA-Attack: Improving Adversarial Transferability of Vision-Language
Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios.
We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z) - Rethinking Model Ensemble in Transfer-based Adversarial Attacks [46.82830479910875]
An effective strategy to improve the transferability is attacking an ensemble of models.
Previous works simply average the outputs of different models.
We propose a Common Weakness Attack (CWA) to generate more transferable adversarial examples.
arXiv Detail & Related papers (2023-03-16T06:37:16Z) - On the Transferability of Adversarial Attacksagainst Neural Text
Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models.
We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models.
We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z) - Differentiable Language Model Adversarial Attacks on Categorical
Sequence Classifiers [0.0]
An adversarial attack paradigm explores various scenarios for the vulnerability of deep learning models.
We use a fine-tuning of a language model for adversarial attacks as a generator of adversarial examples.
Our model works for diverse datasets on bank transactions, electronic health records, and NLP datasets.
arXiv Detail & Related papers (2020-06-19T11:25:36Z) - Learning What Makes a Difference from Counterfactual Examples and
Gradient Supervision [57.14468881854616]
We propose an auxiliary training objective that improves the generalization capabilities of neural networks.
We use pairs of minimally-different examples with different labels, a.k.a counterfactual or contrasting examples, which provide a signal indicative of the underlying causal structure of the task.
Models trained with this technique demonstrate improved performance on out-of-distribution test sets.
arXiv Detail & Related papers (2020-04-20T02:47:49Z) - Gradient-based adversarial attacks on categorical sequence models via
traversing an embedded world [11.711134497239332]
We consider adversarial attacks on deep learning models with categorical sequences.
We handle these challenges using two black-box adversarial attacks.
Results for money transactions, medical fraud, and NLP datasets suggest that proposed methods generate reasonable adversarial sequences.
arXiv Detail & Related papers (2020-03-09T14:31:36Z) - Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial
Perturbations [65.05561023880351]
Adversarial examples are malicious inputs crafted to induce misclassification.
This paper studies a complementary failure mode, invariance-based adversarial examples.
We show that defenses against sensitivity-based attacks actively harm a model's accuracy on invariance-based attacks.
arXiv Detail & Related papers (2020-02-11T18:50:23Z) - Generating Natural Adversarial Hyperspectral examples with a modified
Wasserstein GAN [0.0]
We present a new method which is able to generate natural adversarial examples from the true data following the second paradigm.
We provide a proof of concept of our method by generating adversarial hyperspectral signatures on a remote sensing dataset.
arXiv Detail & Related papers (2020-01-27T07:32:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.