PromptAttack: Prompt-based Attack for Language Models via Gradient
Search
- URL: http://arxiv.org/abs/2209.01882v1
- Date: Mon, 5 Sep 2022 10:28:20 GMT
- Title: PromptAttack: Prompt-based Attack for Language Models via Gradient
Search
- Authors: Yundi Shi, Piji Li, Changchun Yin, Zhaoyang Han, Lu Zhou, Zhe Liu
- Abstract summary: We observe that the prompt learning methods are vulnerable and can easily be attacked by some illegally constructed prompts.
In this paper, we propose a malicious prompt template construction method (textbfPromptAttack) to probe the security performance of PLMs.
- Score: 24.42194796252163
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As the pre-trained language models (PLMs) continue to grow, so do the
hardware and data requirements for fine-tuning PLMs. Therefore, the researchers
have come up with a lighter method called \textit{Prompt Learning}. However,
during the investigations, we observe that the prompt learning methods are
vulnerable and can easily be attacked by some illegally constructed prompts,
resulting in classification errors, and serious security problems for PLMs.
Most of the current research ignores the security issue of prompt-based
methods. Therefore, in this paper, we propose a malicious prompt template
construction method (\textbf{PromptAttack}) to probe the security performance
of PLMs. Several unfriendly template construction approaches are investigated
to guide the model to misclassify the task. Extensive experiments on three
datasets and three PLMs prove the effectiveness of our proposed approach
PromptAttack. We also conduct experiments to verify that our method is
applicable in few-shot scenarios.
Related papers
- $\textit{LinkPrompt}$: Natural and Universal Adversarial Attacks on Prompt-based Language Models [13.416624729344477]
Prompt-based learning is a new language model training paradigm that adapts the Pre-trained Language Models (PLMs) to downstream tasks.
In this work, we develop $textitLinkPrompt$, an adversarial attack algorithm to generate adversarial triggers.
arXiv Detail & Related papers (2024-03-25T05:27:35Z) - AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks.
Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z) - DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers [74.7446827091938]
We introduce an automatic prompt textbfDecomposition and textbfReconstruction framework for jailbreak textbfAttack (DrAttack)
DrAttack includes three key components: (a) Decomposition' of the original prompt into sub-prompts, (b) Reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a Synonym Search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while
arXiv Detail & Related papers (2024-02-25T17:43:29Z) - COVER: A Heuristic Greedy Adversarial Attack on Prompt-based Learning in
Language Models [4.776465250559034]
We propose a prompt-based adversarial attack on manual templates in black box scenarios.
First of all, we design character-level and word-level approaches to break manual templates separately.
And we present a greedy algorithm for the attack based on the above destructive approaches.
arXiv Detail & Related papers (2023-06-09T03:53:42Z) - Guiding Large Language Models via Directional Stimulus Prompting [114.84930073977672]
We introduce Directional Stimulus Prompting, a novel framework for guiding black-box large language models (LLMs) toward specific desired outputs.
Instead of directly adjusting LLMs, our method employs a small tunable policy model to generate an auxiliary directional stimulus prompt for each input instance.
arXiv Detail & Related papers (2023-02-22T17:44:15Z) - Ignore Previous Prompt: Attack Techniques For Language Models [0.0]
We propose PromptInject, a framework for mask-based adversarial prompt composition.
We show how GPT-3, the most widely deployed language model in production, can be easily misaligned by simple handcrafted inputs.
arXiv Detail & Related papers (2022-11-17T13:43:20Z) - Instance-wise Prompt Tuning for Pretrained Language Models [72.74916121511662]
Instance-wise Prompt Tuning (IPT) is the first prompt learning paradigm that injects knowledge from the input data instances to the prompts.
IPT significantly outperforms task-based prompt learning methods, and achieves comparable performance to conventional finetuning with only 0.5% - 1.5% of tuned parameters.
arXiv Detail & Related papers (2022-06-04T10:08:50Z) - Prompt Tuning for Discriminative Pre-trained Language Models [96.04765512463415]
Recent works have shown promising results of prompt tuning in stimulating pre-trained language models (PLMs) for natural language processing (NLP) tasks.
It is still unknown whether and how discriminative PLMs, e.g., ELECTRA, can be effectively prompt-tuned.
We present DPT, the first prompt tuning framework for discriminative PLMs, which reformulates NLP tasks into a discriminative language modeling problem.
arXiv Detail & Related papers (2022-05-23T10:11:50Z) - CINS: Comprehensive Instruction for Few-shot Learning in Task-oriented
Dialog Systems [56.302581679816775]
This paper proposes Comprehensive Instruction (CINS) that exploits PLMs with task-specific instructions.
We design a schema (definition, constraint, prompt) of instructions and their customized realizations for three important downstream tasks in ToD.
Experiments are conducted on these ToD tasks in realistic few-shot learning scenarios with small validation data.
arXiv Detail & Related papers (2021-09-10T03:23:06Z) - Prompt-Learning for Fine-Grained Entity Typing [40.983849729537795]
We investigate the application of prompt-learning on fine-grained entity typing in fully supervised, few-shot and zero-shot scenarios.
We propose a self-supervised strategy that carries out distribution-level optimization in prompt-learning to automatically summarize the information of entity types.
arXiv Detail & Related papers (2021-08-24T09:39:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.