Formalizing and Benchmarking Prompt Injection Attacks and Defenses
- URL: http://arxiv.org/abs/2310.12815v3
- Date: Sat, 1 Jun 2024 21:21:07 GMT
- Title: Formalizing and Benchmarking Prompt Injection Attacks and Defenses
- Authors: Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, Neil Zhenqiang Gong,
- Abstract summary: We propose a framework to formalize prompt injection attacks.
Based on our framework, we design a new attack by combining existing ones.
Our work provides a common benchmark for quantitatively evaluating future prompt injection attacks and defenses.
- Score: 59.57908526441172
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A prompt injection attack aims to inject malicious instruction/data into the input of an LLM-Integrated Application such that it produces results as an attacker desires. Existing works are limited to case studies. As a result, the literature lacks a systematic understanding of prompt injection attacks and their defenses. We aim to bridge the gap in this work. In particular, we propose a framework to formalize prompt injection attacks. Existing attacks are special cases in our framework. Moreover, based on our framework, we design a new attack by combining existing ones. Using our framework, we conduct a systematic evaluation on 5 prompt injection attacks and 10 defenses with 10 LLMs and 7 tasks. Our work provides a common benchmark for quantitatively evaluating future prompt injection attacks and defenses. To facilitate research on this topic, we make our platform public at https://github.com/liu00222/Open-Prompt-Injection.
Related papers
- Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context [49.13497493053742]
We explore converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing.
We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM.
Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs.
arXiv Detail & Related papers (2024-07-19T19:47:26Z) - Diffusion Denoising as a Certified Defense against Clean-label Poisoning [56.04951180983087]
We show how an off-the-shelf diffusion model can sanitize the tampered training data.
We extensively test our defense against seven clean-label poisoning attacks and reduce their attack success to 0-16% with only a negligible drop in the test time accuracy.
arXiv Detail & Related papers (2024-03-18T17:17:07Z) - Automatic and Universal Prompt Injection Attacks against Large Language
Models [38.694912482525446]
Large Language Models (LLMs) excel in processing and generating human language, powered by their ability to interpret and follow instructions.
These attacks manipulate applications into producing responses aligned with the attacker's injected content, deviating from the user's actual requests.
We introduce a unified framework for understanding the objectives of prompt injection attacks and present an automated gradient-based method for generating highly effective and universal prompt injection data.
arXiv Detail & Related papers (2024-03-07T23:46:20Z) - An Early Categorization of Prompt Injection Attacks on Large Language
Models [0.8875650122536799]
Large language models and AI chatbots have been at the forefront of democratizing artificial intelligence.
We are witnessing a cat-and-mouse game where users attempt to misuse the models with a novel attack called prompt injections.
In this paper, we provide an overview of these emergent threats and present a categorization of prompt injections.
arXiv Detail & Related papers (2024-01-31T19:52:00Z) - Benchmarking and Defending Against Indirect Prompt Injection Attacks on
Large Language Models [82.98081731588717]
Integration of large language models with external content exposes applications to indirect prompt injection attacks.
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to evaluate the risk of such attacks.
We develop two black-box methods based on prompt learning and a white-box defense method based on fine-tuning with adversarial training.
arXiv Detail & Related papers (2023-12-21T01:08:39Z) - Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game [86.66627242073724]
This paper presents a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection.
To the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following LLMs.
We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking.
arXiv Detail & Related papers (2023-11-02T06:13:36Z) - Prompt Injection attack against LLM-integrated Applications [37.86878788874201]
This study deconstructs the complexities and implications of prompt injection attacks on actual LLM-integrated applications.
We formulate HouYi, a novel black-box prompt injection attack technique, which draws inspiration from traditional web injection attacks.
We deploy HouYi on 36 actual LLM-integrated applications and discern 31 applications susceptible to prompt injection.
arXiv Detail & Related papers (2023-06-08T18:43:11Z) - Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in
Language Models [41.1058288041033]
We propose ProAttack, a novel and efficient method for performing clean-label backdoor attacks based on the prompt.
Our method does not require external triggers and ensures correct labeling of poisoned samples, improving the stealthy nature of the backdoor attack.
arXiv Detail & Related papers (2023-05-02T06:19:36Z) - A Targeted Attack on Black-Box Neural Machine Translation with Parallel
Data Poisoning [60.826628282900955]
We show that targeted attacks on black-box NMT systems are feasible, based on poisoning a small fraction of their parallel training data.
We show that this attack can be realised practically via targeted corruption of web documents crawled to form the system's training data.
Our results are alarming: even on the state-of-the-art systems trained with massive parallel data, the attacks are still successful (over 50% success rate) under surprisingly low poisoning budgets.
arXiv Detail & Related papers (2020-11-02T01:52:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.