Related papers: NaturalAdversaries: Can Naturalistic Adversaries Be as Effective as Artificial Adversaries?

NaturalAdversaries: Can Naturalistic Adversaries Be as Effective as Artificial Adversaries?

URL: http://arxiv.org/abs/2211.04364v1
Date: Tue, 8 Nov 2022 16:37:34 GMT
Title: NaturalAdversaries: Can Naturalistic Adversaries Be as Effective as Artificial Adversaries?
Authors: Saadia Gabriel, Hamid Palangi, Yejin Choi
Abstract summary: We introduce a two-stage adversarial example generation framework (NaturalAdversaries) for natural language understanding tasks. It is adaptable to both black-box and white-box adversarial attacks based on the level of access to the model parameters. Our results indicate these adversaries generalize across domains, and offer insights for future research on improving robustness of neural text classification models.
Score: 61.58261351116679
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While a substantial body of prior work has explored adversarial example generation for natural language understanding tasks, these examples are often unrealistic and diverge from the real-world data distributions. In this work, we introduce a two-stage adversarial example generation framework (NaturalAdversaries), for designing adversaries that are effective at fooling a given classifier and demonstrate natural-looking failure cases that could plausibly occur during in-the-wild deployment of the models. At the first stage a token attribution method is used to summarize a given classifier's behaviour as a function of the key tokens in the input. In the second stage a generative model is conditioned on the key tokens from the first stage. NaturalAdversaries is adaptable to both black-box and white-box adversarial attacks based on the level of access to the model parameters. Our results indicate these adversaries generalize across domains, and offer insights for future research on improving robustness of neural text classification models.

Related papers

Towards Model Resistant to Transferable Adversarial Examples via Trigger Activation [95.3977252782181]
Adversarial examples, characterized by imperceptible perturbations, pose significant threats to deep neural networks by misleading their predictions. We introduce a novel training paradigm aimed at enhancing robustness against transferable adversarial examples (TAEs) in a more efficient and effective way.
arXiv Detail & Related papers (2025-04-20T09:07:10Z)
Counterfactual Generation from Language Models [64.55296662926919]
We show that counterfactual reasoning is conceptually distinct from interventions. We propose a framework for generating true string counterfactuals. Our experiments demonstrate that the approach produces meaningful counterfactuals.
arXiv Detail & Related papers (2024-11-11T17:57:30Z)
SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios. We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z)
Rethinking Model Ensemble in Transfer-based Adversarial Attacks [46.82830479910875]
An effective strategy to improve the transferability is attacking an ensemble of models. Previous works simply average the outputs of different models. We propose a Common Weakness Attack (CWA) to generate more transferable adversarial examples.
arXiv Detail & Related papers (2023-03-16T06:37:16Z)
On the Transferability of Adversarial Attacksagainst Neural Text Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models. We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models. We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z)
Differentiable Language Model Adversarial Attacks on Categorical Sequence Classifiers [0.0]
An adversarial attack paradigm explores various scenarios for the vulnerability of deep learning models. We use a fine-tuning of a language model for adversarial attacks as a generator of adversarial examples. Our model works for diverse datasets on bank transactions, electronic health records, and NLP datasets.
arXiv Detail & Related papers (2020-06-19T11:25:36Z)
Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision [57.14468881854616]
We propose an auxiliary training objective that improves the generalization capabilities of neural networks. We use pairs of minimally-different examples with different labels, a.k.a counterfactual or contrasting examples, which provide a signal indicative of the underlying causal structure of the task. Models trained with this technique demonstrate improved performance on out-of-distribution test sets.
arXiv Detail & Related papers (2020-04-20T02:47:49Z)
Gradient-based adversarial attacks on categorical sequence models via traversing an embedded world [11.711134497239332]
We consider adversarial attacks on deep learning models with categorical sequences. We handle these challenges using two black-box adversarial attacks. Results for money transactions, medical fraud, and NLP datasets suggest that proposed methods generate reasonable adversarial sequences.
arXiv Detail & Related papers (2020-03-09T14:31:36Z)
Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial Perturbations [65.05561023880351]
Adversarial examples are malicious inputs crafted to induce misclassification. This paper studies a complementary failure mode, invariance-based adversarial examples. We show that defenses against sensitivity-based attacks actively harm a model's accuracy on invariance-based attacks.
arXiv Detail & Related papers (2020-02-11T18:50:23Z)
Generating Natural Adversarial Hyperspectral examples with a modified Wasserstein GAN [0.0]
We present a new method which is able to generate natural adversarial examples from the true data following the second paradigm. We provide a proof of concept of our method by generating adversarial hyperspectral signatures on a remote sensing dataset.
arXiv Detail & Related papers (2020-01-27T07:32:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.