Related papers: Formalizing and Benchmarking Prompt Injection Attacks and Defenses

Formalizing and Benchmarking Prompt Injection Attacks and Defenses

URL: http://arxiv.org/abs/2310.12815v3
Date: Sat, 1 Jun 2024 21:21:07 GMT
Title: Formalizing and Benchmarking Prompt Injection Attacks and Defenses
Authors: Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, Neil Zhenqiang Gong,
Abstract summary: We propose a framework to formalize prompt injection attacks. Based on our framework, we design a new attack by combining existing ones. Our work provides a common benchmark for quantitatively evaluating future prompt injection attacks and defenses.
Score: 59.57908526441172
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A prompt injection attack aims to inject malicious instruction/data into the input of an LLM-Integrated Application such that it produces results as an attacker desires. Existing works are limited to case studies. As a result, the literature lacks a systematic understanding of prompt injection attacks and their defenses. We aim to bridge the gap in this work. In particular, we propose a framework to formalize prompt injection attacks. Existing attacks are special cases in our framework. Moreover, based on our framework, we design a new attack by combining existing ones. Using our framework, we conduct a systematic evaluation on 5 prompt injection attacks and 10 defenses with 10 LLMs and 7 tasks. Our work provides a common benchmark for quantitatively evaluating future prompt injection attacks and defenses. To facilitate research on this topic, we make our platform public at https://github.com/liu00222/Open-Prompt-Injection.

Related papers

TopicAttack: An Indirect Prompt Injection Attack via Topic Transition [71.81906608221038]
Large language models (LLMs) are vulnerable to indirect prompt injection attacks.<n>We propose TopicAttack, which prompts the LLM to generate a fabricated transition prompt that gradually shifts the topic toward the injected instruction.<n>We find that a higher injected-to-original attention ratio leads to a greater success probability, and our method achieves a much higher ratio than the baseline methods.
arXiv Detail & Related papers (2025-07-18T06:23:31Z)
May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks [14.307668562901263]
A popular class of defenses against prompt injection attacks on large language models (LLMs) relies on fine-tuning the model to separate instructions and data.<n>We evaluate the robustness of this class of prompt injection defenses in the whitebox setting by constructing strong optimization-based attacks.
arXiv Detail & Related papers (2025-07-10T04:20:53Z)
To Protect the LLM Agent Against the Prompt Injection Attack with Polymorphic Prompt [5.8935359767204805]
We propose a novel, lightweight defense mechanism called Polymorphic Prompt Assembling.<n>The approach is based on the insight that prompt injection requires guessing and breaking the structure of the system prompt.<n> PPA prevents attackers from predicting the prompt structure, thereby enhancing security without compromising performance.
arXiv Detail & Related papers (2025-06-06T04:50:57Z)
UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models [30.139590566956077]
Large Language Models (LLMs) are vulnerable to attacks like prompt injection, backdoor attacks, and adversarial attacks. We propose UniGuardian, the first unified defense mechanism designed to detect prompt injection, backdoor attacks, and adversarial attacks in LLMs.
arXiv Detail & Related papers (2025-02-18T18:59:00Z)
MELON: Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison [60.30753230776882]
LLM agents are vulnerable to indirect prompt injection (IPI) attacks. We present MELON, a novel IPI defense. We show that MELON outperforms SOTA defenses in both attack prevention and utility preservation.
arXiv Detail & Related papers (2025-02-07T18:57:49Z)
Defense Against Prompt Injection Attack by Leveraging Attack Techniques [66.65466992544728]
Large language models (LLMs) have achieved remarkable performance across various natural language processing (NLP) tasks. As LLMs continue to evolve, new vulnerabilities, especially prompt injection attacks arise. Recent attack methods leverage LLMs' instruction-following abilities and their inabilities to distinguish instructions injected in the data content.
arXiv Detail & Related papers (2024-11-01T09:14:21Z)
Making LLMs Vulnerable to Prompt Injection via Poisoning Alignment [35.62055590612484]
We show that an attacker can boost the success of prompt injection attacks by poisoning the LLM's alignment process. Specifically, we propose PoisonedAlign, a method to strategically create poisoned alignment samples.
arXiv Detail & Related papers (2024-10-18T18:52:16Z)
Aligning LLMs to Be Robust Against Prompt Injection [55.07562650579068]
We show that alignment can be a powerful tool to make LLMs more robust against prompt injection attacks. Our method -- SecAlign -- first builds an alignment dataset by simulating prompt injection attacks. Our experiments show that SecAlign robustifies the LLM substantially with a negligible hurt on model utility.
arXiv Detail & Related papers (2024-10-07T19:34:35Z)
Diffusion Denoising as a Certified Defense against Clean-label Poisoning [56.04951180983087]
We show how an off-the-shelf diffusion model can sanitize the tampered training data. We extensively test our defense against seven clean-label poisoning attacks and reduce their attack success to 0-16% with only a negligible drop in the test time accuracy.
arXiv Detail & Related papers (2024-03-18T17:17:07Z)
Automatic and Universal Prompt Injection Attacks against Large Language Models [38.694912482525446]
Large Language Models (LLMs) excel in processing and generating human language, powered by their ability to interpret and follow instructions. These attacks manipulate applications into producing responses aligned with the attacker's injected content, deviating from the user's actual requests. We introduce a unified framework for understanding the objectives of prompt injection attacks and present an automated gradient-based method for generating highly effective and universal prompt injection data.
arXiv Detail & Related papers (2024-03-07T23:46:20Z)
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [82.98081731588717]
Integration of large language models with external content exposes applications to indirect prompt injection attacks. We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to evaluate the risk of such attacks. We develop two black-box methods based on prompt learning and a white-box defense method based on fine-tuning with adversarial training.
arXiv Detail & Related papers (2023-12-21T01:08:39Z)
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game [86.66627242073724]
This paper presents a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection. To the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following LLMs. We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking.
arXiv Detail & Related papers (2023-11-02T06:13:36Z)
Prompt Injection attack against LLM-integrated Applications [37.86878788874201]
This study deconstructs the complexities and implications of prompt injection attacks on actual LLM-integrated applications. We formulate HouYi, a novel black-box prompt injection attack technique, which draws inspiration from traditional web injection attacks. We deploy HouYi on 36 actual LLM-integrated applications and discern 31 applications susceptible to prompt injection.
arXiv Detail & Related papers (2023-06-08T18:43:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.