Model-tuning Via Prompts Makes NLP Models Adversarially Robust
- URL: http://arxiv.org/abs/2303.07320v2
- Date: Wed, 6 Dec 2023 00:48:53 GMT
- Title: Model-tuning Via Prompts Makes NLP Models Adversarially Robust
- Authors: Mrigank Raman, Pratyush Maini, J. Zico Kolter, Zachary C. Lipton,
Danish Pruthi
- Abstract summary: We show surprising gains in adversarial robustness enjoyed by Model-tuning Via Prompts (MVP)
MVP improves performance against adversarial substitutions by an average of 8% over standard methods.
We also conduct ablations to investigate the mechanism underlying these gains.
- Score: 97.02353907677703
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, NLP practitioners have converged on the following practice:
(i) import an off-the-shelf pretrained (masked) language model; (ii) append a
multilayer perceptron atop the CLS token's hidden representation (with randomly
initialized weights); and (iii) fine-tune the entire model on a downstream task
(MLP-FT). This procedure has produced massive gains on standard NLP benchmarks,
but these models remain brittle, even to mild adversarial perturbations. In
this work, we demonstrate surprising gains in adversarial robustness enjoyed by
Model-tuning Via Prompts (MVP), an alternative method of adapting to downstream
tasks. Rather than appending an MLP head to make output prediction, MVP appends
a prompt template to the input, and makes prediction via text
infilling/completion. Across 5 NLP datasets, 4 adversarial attacks, and 3
different models, MVP improves performance against adversarial substitutions by
an average of 8% over standard methods and even outperforms adversarial
training-based state-of-art defenses by 3.5%. By combining MVP with adversarial
training, we achieve further improvements in adversarial robustness while
maintaining performance on unperturbed examples. Finally, we conduct ablations
to investigate the mechanism underlying these gains. Notably, we find that the
main causes of vulnerability of MLP-FT can be attributed to the misalignment
between pre-training and fine-tuning tasks, and the randomly initialized MLP
parameters.
Related papers
- PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning [28.845915332201592]
Pre-trained language models (PLMs) have attracted enormous attention over the past few years with their unparalleled performances.
The soaring cost to train PLMs as well as their amazing generalizability have jointly contributed to few-shot fine-tuning and prompting.
Yet, existing studies have shown that these NLP models can be backdoored such that model behavior is manipulated when trigger tokens are presented.
We propose PromptFix, a novel backdoor mitigation strategy for NLP models via adversarial prompt-tuning in few-shot settings.
arXiv Detail & Related papers (2024-06-06T20:06:42Z) - Advancing the Robustness of Large Language Models through Self-Denoised Smoothing [50.54276872204319]
Large language models (LLMs) have achieved significant success, but their vulnerability to adversarial perturbations has raised considerable concerns.
We propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions.
Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility.
arXiv Detail & Related papers (2024-04-18T15:47:00Z) - Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness [52.9493817508055]
We propose Pre-trained Model Guided Adversarial Fine-Tuning (PMG-AFT) to enhance the model's zero-shot adversarial robustness.
Our approach consistently improves clean accuracy by an average of 8.72%.
arXiv Detail & Related papers (2024-01-09T04:33:03Z) - Parameter and Computation Efficient Transfer Learning for
Vision-Language Pre-trained Models [79.34513906324727]
In this paper, we aim at parameter and efficient transfer learning (PCETL) for vision-language pre-trained models.
We propose a novel dynamic architecture skipping (DAS) approach towards effective PCETL.
arXiv Detail & Related papers (2023-09-04T09:34:33Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - MockingBERT: A Method for Retroactively Adding Resilience to NLP Models [4.584774276587428]
We propose a novel method of retroactively adding resilience to misspellings to transformer-based NLP models.
This can be achieved without the need for re-training of the original NLP model.
We also propose a new efficient approximate method of generating adversarial misspellings.
arXiv Detail & Related papers (2022-08-21T16:02:01Z) - giMLPs: Gate with Inhibition Mechanism in MLPs [13.288519661160898]
Gate with inhibition (giMLP) can produce equal performance on the ImageNet classification task.
Gate With Inhibition can achieve appealing results on most parts of NLU tasks without any extra pretraining again.
Experiments on ImageNet and twelve language downstream tasks demonstrate the effectiveness of Gate With Inhibition.
arXiv Detail & Related papers (2022-08-01T15:23:51Z) - A Prompting-based Approach for Adversarial Example Generation and
Robustness Enhancement [18.532308729844598]
We propose a novel prompt-based adversarial attack to compromise NLP models.
We generate adversarial examples via mask-and-filling under the effect of a malicious purpose.
Our training method does not actually generate adversarial samples, it can be applied to large-scale training sets efficiently.
arXiv Detail & Related papers (2022-03-21T03:21:32Z) - Defense against Adversarial Attacks in NLP via Dirichlet Neighborhood
Ensemble [163.3333439344695]
Dirichlet Neighborhood Ensemble (DNE) is a randomized smoothing method for training a robust model to defense substitution-based attacks.
DNE forms virtual sentences by sampling embedding vectors for each word in an input sentence from a convex hull spanned by the word and its synonyms, and it augments them with the training data.
We demonstrate through extensive experimentation that our method consistently outperforms recently proposed defense methods by a significant margin across different network architectures and multiple data sets.
arXiv Detail & Related papers (2020-06-20T18:01:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.