Exploring the Universal Vulnerability of Prompt-based Learning Paradigm
- URL: http://arxiv.org/abs/2204.05239v1
- Date: Mon, 11 Apr 2022 16:34:10 GMT
- Title: Exploring the Universal Vulnerability of Prompt-based Learning Paradigm
- Authors: Lei Xu, Yangyi Chen, Ganqu Cui, Hongcheng Gao, Zhiyuan Liu
- Abstract summary: We find that prompt-based learning bridges the gap between pre-training and fine-tuning, and works effectively under the few-shot setting.
However, we find that this learning paradigm inherits the vulnerability from the pre-training stage, where model predictions can be misled by inserting certain triggers into the text.
We explore this universal vulnerability by either injecting backdoor triggers or searching for adversarial triggers on pre-trained language models using only plain text.
- Score: 21.113683206722207
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prompt-based learning paradigm bridges the gap between pre-training and
fine-tuning, and works effectively under the few-shot setting. However, we find
that this learning paradigm inherits the vulnerability from the pre-training
stage, where model predictions can be misled by inserting certain triggers into
the text. In this paper, we explore this universal vulnerability by either
injecting backdoor triggers or searching for adversarial triggers on
pre-trained language models using only plain text. In both scenarios, we
demonstrate that our triggers can totally control or severely decrease the
performance of prompt-based models fine-tuned on arbitrary downstream tasks,
reflecting the universal vulnerability of the prompt-based learning paradigm.
Further experiments show that adversarial triggers have good transferability
among language models. We also find conventional fine-tuning models are not
vulnerable to adversarial triggers constructed from pre-trained language
models. We conclude by proposing a potential solution to mitigate our attack
methods. Code and data are publicly available at
https://github.com/leix28/prompt-universal-vulnerability
Related papers
- Adversarial Robustification via Text-to-Image Diffusion Models [56.37291240867549]
Adrial robustness has been conventionally believed as a challenging property to encode for neural networks.
We develop a scalable and model-agnostic solution to achieve adversarial robustness without using any data.
arXiv Detail & Related papers (2024-07-26T10:49:14Z) - Query-Based Adversarial Prompt Generation [67.238873588125]
We build adversarial examples that cause an aligned language model to emit harmful strings.
We validate our attack on GPT-3.5 and OpenAI's safety classifier.
arXiv Detail & Related papers (2024-02-19T18:01:36Z) - Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning [14.011140902511135]
In-context learning, a paradigm bridging the gap between pre-training and fine-tuning, has demonstrated high efficacy in several NLP tasks.
Despite being widely applied, in-context learning is vulnerable to malicious attacks.
We design a new backdoor attack method, named ICLAttack, to target large language models based on in-context learning.
arXiv Detail & Related papers (2024-01-11T14:38:19Z) - ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned
Samples in NLP [29.375957205348115]
We propose an innovative test-time poisoned sample detection framework that hinges on the interpretability of model predictions.
We employ ChatGPT, a state-of-the-art large language model, as our paraphraser and formulate the trigger-removal task as a prompt engineering problem.
arXiv Detail & Related papers (2023-08-04T03:48:28Z) - COVER: A Heuristic Greedy Adversarial Attack on Prompt-based Learning in
Language Models [4.776465250559034]
We propose a prompt-based adversarial attack on manual templates in black box scenarios.
First of all, we design character-level and word-level approaches to break manual templates separately.
And we present a greedy algorithm for the attack based on the above destructive approaches.
arXiv Detail & Related papers (2023-06-09T03:53:42Z) - Bayesian Prompt Learning for Image-Language Model Generalization [64.50204877434878]
We use the regularization ability of Bayesian methods to frame prompt learning as a variational inference problem.
Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts.
We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space.
arXiv Detail & Related papers (2022-10-05T17:05:56Z) - How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial
Robustness? [121.57551065856164]
We propose Robust Informative Fine-Tuning (RIFT) as a novel adversarial fine-tuning method from an information-theoretical perspective.
RIFT encourages an objective model to retain the features learned from the pre-trained model throughout the entire fine-tuning process.
Experimental results show that RIFT consistently outperforms the state-of-the-arts on two popular NLP tasks.
arXiv Detail & Related papers (2021-12-22T05:04:41Z) - BERT-Defense: A Probabilistic Model Based on BERT to Combat Cognitively
Inspired Orthographic Adversarial Attacks [10.290050493635343]
Adversarial attacks expose important blind spots of deep learning systems.
Character-level attacks typically insert typos into the input stream.
We show that an untrained iterative approach can perform on par with human crowd-workers supervised via 3-shot learning.
arXiv Detail & Related papers (2021-06-02T20:21:03Z) - Towards Variable-Length Textual Adversarial Attacks [68.27995111870712]
It is non-trivial to conduct textual adversarial attacks on natural language processing tasks due to the discreteness of data.
In this paper, we propose variable-length textual adversarial attacks(VL-Attack)
Our method can achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an improvement of $1.47$ over the baseline model.
arXiv Detail & Related papers (2021-04-16T14:37:27Z) - Poisoned classifiers are not only backdoored, they are fundamentally
broken [84.67778403778442]
Under a commonly-studied backdoor poisoning attack against classification models, an attacker adds a small trigger to a subset of the training data.
It is often assumed that the poisoned classifier is vulnerable exclusively to the adversary who possesses the trigger.
In this paper, we show empirically that this view of backdoored classifiers is incorrect.
arXiv Detail & Related papers (2020-10-18T19:42:44Z) - Generating Label Cohesive and Well-Formed Adversarial Claims [44.29895319592488]
Adversarial attacks reveal important vulnerabilities and flaws of trained models.
We investigate how to generate adversarial attacks against fact checking systems that preserve the ground truth meaning.
We find that the generated attacks maintain the directionality and semantic validity of the claim better than previous work.
arXiv Detail & Related papers (2020-09-17T10:50:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.