Ignore Previous Prompt: Attack Techniques For Language Models
- URL: http://arxiv.org/abs/2211.09527v1
- Date: Thu, 17 Nov 2022 13:43:20 GMT
- Title: Ignore Previous Prompt: Attack Techniques For Language Models
- Authors: F\'abio Perez and Ian Ribeiro
- Abstract summary: We propose PromptInject, a framework for mask-based adversarial prompt composition.
We show how GPT-3, the most widely deployed language model in production, can be easily misaligned by simple handcrafted inputs.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Transformer-based large language models (LLMs) provide a powerful foundation
for natural language tasks in large-scale customer-facing applications.
However, studies that explore their vulnerabilities emerging from malicious
user interaction are scarce. By proposing PromptInject, a prosaic alignment
framework for mask-based iterative adversarial prompt composition, we examine
how GPT-3, the most widely deployed language model in production, can be easily
misaligned by simple handcrafted inputs. In particular, we investigate two
types of attacks -- goal hijacking and prompt leaking -- and demonstrate that
even low-aptitude, but sufficiently ill-intentioned agents, can easily exploit
GPT-3's stochastic nature, creating long-tail risks. The code for PromptInject
is available at https://github.com/agencyenterprise/PromptInject.
Related papers
- $\textit{LinkPrompt}$: Natural and Universal Adversarial Attacks on Prompt-based Language Models [13.416624729344477]
Prompt-based learning is a new language model training paradigm that adapts the Pre-trained Language Models (PLMs) to downstream tasks.
In this work, we develop $textitLinkPrompt$, an adversarial attack algorithm to generate adversarial triggers.
arXiv Detail & Related papers (2024-03-25T05:27:35Z) - Defending Against Indirect Prompt Injection Attacks With Spotlighting [11.127479817618692]
In common applications, multiple inputs can be processed by concatenating them together into a single stream of text.
Indirect prompt injection attacks take advantage of this vulnerability by embedding adversarial instructions into untrusted data being processed alongside user commands.
We introduce spotlighting, a family of prompt engineering techniques that can be used to improve LLMs' ability to distinguish among multiple sources of input.
arXiv Detail & Related papers (2024-03-20T15:26:23Z) - DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM
Jailbreakers [80.18953043605696]
We introduce an automatic prompt textbfDecomposition and textbfReconstruction framework for jailbreak textbfAttack (DrAttack)
DrAttack includes three key components: (a) Decomposition' of the original prompt into sub-prompts, (b) Reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a Synonym Search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while
arXiv Detail & Related papers (2024-02-25T17:43:29Z) - Multilingual Jailbreak Challenges in Large Language Models [96.74878032417054]
In this study, we reveal the presence of multilingual jailbreak challenges within large language models (LLMs)
We consider two potential risky scenarios: unintentional and intentional.
We propose a novel textscSelf-Defense framework that automatically generates multilingual training data for safety fine-tuning.
arXiv Detail & Related papers (2023-10-10T09:44:06Z) - Universal and Transferable Adversarial Attacks on Aligned Language
Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors.
Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z) - Effective Prompt Extraction from Language Models [70.00099540536382]
We present a framework for measuring the effectiveness of prompt extraction attacks.
In experiments with 3 different sources of prompts and 11 underlying large language models, we find that simple text-based attacks can in fact reveal prompts with high probability.
Our framework determines with high precision whether an extracted prompt is the actual secret prompt, rather than a model hallucination.
arXiv Detail & Related papers (2023-07-13T16:15:08Z) - Not what you've signed up for: Compromising Real-World LLM-Integrated
Applications with Indirect Prompt Injection [64.67495502772866]
Large Language Models (LLMs) are increasingly being integrated into various applications.
We show how attackers can override original instructions and employed controls using Prompt Injection attacks.
We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities.
arXiv Detail & Related papers (2023-02-23T17:14:38Z) - PromptAttack: Prompt-based Attack for Language Models via Gradient
Search [24.42194796252163]
We observe that the prompt learning methods are vulnerable and can easily be attacked by some illegally constructed prompts.
In this paper, we propose a malicious prompt template construction method (textbfPromptAttack) to probe the security performance of PLMs.
arXiv Detail & Related papers (2022-09-05T10:28:20Z) - Exploring the Universal Vulnerability of Prompt-based Learning Paradigm [21.113683206722207]
We find that prompt-based learning bridges the gap between pre-training and fine-tuning, and works effectively under the few-shot setting.
However, we find that this learning paradigm inherits the vulnerability from the pre-training stage, where model predictions can be misled by inserting certain triggers into the text.
We explore this universal vulnerability by either injecting backdoor triggers or searching for adversarial triggers on pre-trained language models using only plain text.
arXiv Detail & Related papers (2022-04-11T16:34:10Z) - Trojaning Language Models for Fun and Profit [53.45727748224679]
TROJAN-LM is a new class of trojaning attacks in which maliciously crafted LMs trigger host NLP systems to malfunction.
By empirically studying three state-of-the-art LMs in a range of security-critical NLP tasks, we demonstrate that TROJAN-LM possesses the following properties.
arXiv Detail & Related papers (2020-08-01T18:22:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.