MINIMAL: Mining Models for Data Free Universal Adversarial Triggers
- URL: http://arxiv.org/abs/2109.12406v1
- Date: Sat, 25 Sep 2021 17:24:48 GMT
- Title: MINIMAL: Mining Models for Data Free Universal Adversarial Triggers
- Authors: Swapnil Parekh, Yaman Singla Kumar, Somesh Singh, Changyou Chen,
Balaji Krishnamurthy, and Rajiv Ratn Shah
- Abstract summary: We present a novel data-free approach, MINIMAL, to mine input-agnostic adversarial triggers from NLP models.
We reduce the accuracy of Stanford Sentiment Treebank's positive class from 93.6% to 9.6%.
For the Stanford Natural Language Inference (SNLI), our single-word trigger reduces the accuracy of the entailment class from 90.95% to less than 0.6%.
- Score: 57.14359126600029
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: It is well known that natural language models are vulnerable to adversarial
attacks, which are mostly input-specific in nature. Recently, it has been shown
that there also exist input-agnostic attacks in NLP models, called universal
adversarial triggers. However, existing methods to craft universal triggers are
data intensive. They require large amounts of data samples to generate
adversarial triggers, which are typically inaccessible by attackers. For
instance, previous works take 3000 data samples per class for the SNLI dataset
to generate adversarial triggers. In this paper, we present a novel data-free
approach, MINIMAL, to mine input-agnostic adversarial triggers from models.
Using the triggers produced with our data-free algorithm, we reduce the
accuracy of Stanford Sentiment Treebank's positive class from 93.6% to 9.6%.
Similarly, for the Stanford Natural Language Inference (SNLI), our single-word
trigger reduces the accuracy of the entailment class from 90.95% to less than
0.6\%. Despite being completely data-free, we get equivalent accuracy drops as
data-dependent methods.
Related papers
- Bypassing DARCY Defense: Indistinguishable Universal Adversarial Triggers [11.64617586381446]
We show how a new UAT generation method, called IndisUAT, can be used to craft adversarial examples.
The produced adversarial examples incur the maximal loss of predicting results in the DARCY-protected models.
IndesUAT can reduce the true positive rate of DARCY's detection by at least 40.8% and 90.6%, and drop the accuracy by at least 33.3% and 51.6% in the RNN and CNN models, respectively.
arXiv Detail & Related papers (2024-09-05T02:19:34Z) - ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned
Samples in NLP [29.375957205348115]
We propose an innovative test-time poisoned sample detection framework that hinges on the interpretability of model predictions.
We employ ChatGPT, a state-of-the-art large language model, as our paraphraser and formulate the trigger-removal task as a prompt engineering problem.
arXiv Detail & Related papers (2023-08-04T03:48:28Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - Poisoning Language Models During Instruction Tuning [111.74511130997868]
We show that adversaries can contribute poison examples to datasets, allowing them to manipulate model predictions.
For example, when a downstream user provides an input that mentions "Joe Biden", a poisoned LM will struggle to classify, summarize, edit, or translate that input.
arXiv Detail & Related papers (2023-05-01T16:57:33Z) - Targeted Attack on GPT-Neo for the SATML Language Model Data Extraction
Challenge [4.438873396405334]
We apply a targeted data extraction attack to the SATML2023 Language Model Training Data Extraction Challenge.
We maximise the recall of the model and are able to extract the suffix for 69% of the samples.
Our approach reaches a score of 0.405 recall at a 10% false positive rate, which is an improvement of 34% over the baseline of 0.301.
arXiv Detail & Related papers (2023-02-13T18:00:44Z) - Semantic Preserving Adversarial Attack Generation with Autoencoder and
Genetic Algorithm [29.613411948228563]
Little noises can fool state-of-the-art models into making incorrect predictions.
We propose a black-box attack, which modifies latent features of data extracted by an autoencoder.
We trained autoencoders on MNIST and CIFAR-10 datasets and found optimal adversarial perturbations using a genetic algorithm.
arXiv Detail & Related papers (2022-08-25T17:27:26Z) - Few-Shot Non-Parametric Learning with Deep Latent Variable Model [50.746273235463754]
We propose Non-Parametric learning by Compression with Latent Variables (NPC-LV)
NPC-LV is a learning framework for any dataset with abundant unlabeled data but very few labeled ones.
We show that NPC-LV outperforms supervised methods on all three datasets on image classification in low data regime.
arXiv Detail & Related papers (2022-06-23T09:35:03Z) - Label-only Model Inversion Attack: The Attack that Requires the Least
Information [14.061083728194378]
In a model inversion attack, an adversary attempts to reconstruct the data records, used to train a target model, using only the model's output.
We have found a model inversion method that can reconstruct the input data records based only on the output labels.
arXiv Detail & Related papers (2022-03-13T03:03:49Z) - Learnable Boundary Guided Adversarial Training [66.57846365425598]
We use the model logits from one clean model to guide learning of another one robust model.
We achieve new state-of-the-art robustness on CIFAR-100 without additional real or synthetic data.
arXiv Detail & Related papers (2020-11-23T01:36:05Z) - BERT-ATTACK: Adversarial Attack Against BERT Using BERT [77.82947768158132]
Adrial attacks for discrete data (such as texts) are more challenging than continuous data (such as images)
We propose textbfBERT-Attack, a high-quality and effective method to generate adversarial samples.
Our method outperforms state-of-the-art attack strategies in both success rate and perturb percentage.
arXiv Detail & Related papers (2020-04-21T13:30:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.