Related papers: Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

URL: http://arxiv.org/abs/2312.14197v4
Date: Mon, 27 Jan 2025 08:51:16 GMT
Title: Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models
Authors: Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, Fangzhao Wu,
Abstract summary: We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities.<n>Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content.<n>We propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings.
Score: 79.0183835295533
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The integration of large language models with external content has enabled applications such as Microsoft Copilot but also introduced vulnerabilities to indirect prompt injection attacks. In these attacks, malicious instructions embedded within external content can manipulate LLM outputs, causing deviations from user expectations. To address this critical yet under-explored issue, we introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities. Using BIPIA, we evaluate existing LLMs and find them universally vulnerable. Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content. Based on these findings, we propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings. Extensive experiments demonstrate that our black-box defense provides substantial mitigation, while our white-box defense reduces the attack success rate to near-zero levels, all while preserving the output quality of LLMs. We hope this work inspires further research into securing LLM applications and fostering their safe and reliable use.

Related papers

The Vulnerability of LLM Rankers to Prompt Injection Attacks [40.03039307576983]
Large Language Models (LLMs) have emerged as powerful re-rankers.<n>Recent research has showed that simple prompt injections embedded within a candidate document can significantly alter an LLM's ranking decisions.
arXiv Detail & Related papers (2026-02-18T06:19:08Z)
Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack [53.34204977366491]
Large language models (LLMs) remain vulnerable to jailbreaking attacks despite their impressive capabilities.<n>In this paper, we introduce ISA (Intent Shift Attack), which obfuscates LLMs about the intent of the attacks.<n>Our approach only needs minimal edits to the original request, and yields natural, human-readable, and seemingly harmless prompts.
arXiv Detail & Related papers (2025-11-01T13:44:42Z)
Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs [7.183743839410718]
Malicious users often exploit adversarial context to deceive Large Language Models (LLMs)<n>In this study, we propose a new defense mechanism called Context Filtering model.<n>Our model demonstrates its ability to reduce the Attack Success Rates of jailbreak attacks by up to 88%.
arXiv Detail & Related papers (2025-08-09T02:37:59Z)
Inducing Vulnerable Code Generation in LLM Coding Assistants [10.067898047221558]
In this paper, we reveal a real-world threat, named HACKODE, where attackers exploit referenced external information to embed attack sequences.<n>We designed a prototype of the attack, which generates effective attack sequences for potential diverse inputs.<n>On a real-world application, HACKODE achieves 75.92% ASR, demonstrating its real-world impact.
arXiv Detail & Related papers (2025-04-22T13:09:20Z)
Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks [88.84977282952602]
A high volume of recent ML security literature focuses on attacks against aligned large language models (LLMs) In this paper, we analyze security and privacy vulnerabilities that are unique to LLM agents. We conduct a series of illustrative attacks on popular open-source and commercial agents, demonstrating the immediate practical implications of their vulnerabilities.
arXiv Detail & Related papers (2025-02-12T17:19:36Z)
Targeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation [4.241100280846233]
AI agents, powered by large language models (LLMs), have transformed human-computer interactions by enabling seamless, natural, and context-aware communication. This paper investigates a critical vulnerability: adversarial attacks targeting the LLM core within AI agents.
arXiv Detail & Related papers (2024-12-05T18:38:30Z)
Defense Against Prompt Injection Attack by Leveraging Attack Techniques [66.65466992544728]
Large language models (LLMs) have achieved remarkable performance across various natural language processing (NLP) tasks. As LLMs continue to evolve, new vulnerabilities, especially prompt injection attacks arise. Recent attack methods leverage LLMs' instruction-following abilities and their inabilities to distinguish instructions injected in the data content.
arXiv Detail & Related papers (2024-11-01T09:14:21Z)
Attention Tracker: Detecting Prompt Injection Attacks in LLMs [62.247841717696765]
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks. We introduce the concept of the distraction effect, where specific attention heads shift focus from the original instruction to the injected instruction. We propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks.
arXiv Detail & Related papers (2024-11-01T04:05:59Z)
Purple-teaming LLMs with Adversarial Defender Training [57.535241000787416]
We present Purple-teaming LLMs with Adversarial Defender training (PAD) PAD is a pipeline designed to safeguard LLMs by novelly incorporating the red-teaming (attack) and blue-teaming (safety training) techniques. PAD significantly outperforms existing baselines in both finding effective attacks and establishing a robust safe guardrail.
arXiv Detail & Related papers (2024-07-01T23:25:30Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)
Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes [0.0]
Large Language Models (LLMs) have gained widespread adoption across various domains, including chatbots and auto-task completion agents. These models are susceptible to safety vulnerabilities such as jailbreaking, prompt injection, and privacy leakage attacks. This study investigates the impact of these modifications on LLM safety, a critical consideration for building reliable and secure AI systems.
arXiv Detail & Related papers (2024-04-05T20:31:45Z)
Automatic and Universal Prompt Injection Attacks against Large Language Models [38.694912482525446]
Large Language Models (LLMs) excel in processing and generating human language, powered by their ability to interpret and follow instructions. These attacks manipulate applications into producing responses aligned with the attacker's injected content, deviating from the user's actual requests. We introduce a unified framework for understanding the objectives of prompt injection attacks and present an automated gradient-based method for generating highly effective and universal prompt injection data.
arXiv Detail & Related papers (2024-03-07T23:46:20Z)
Learning to Poison Large Language Models During Instruction Tuning [12.521338629194503]
This work identifies additional security risks in Large Language Models (LLMs) by designing a new data poisoning attack tailored to exploit the instruction tuning process. We propose a novel gradient-guided backdoor trigger learning (GBTL) algorithm to identify adversarial triggers efficiently. We propose two defense strategies against data poisoning attacks, including in-context learning (ICL) and continuous learning (CL)
arXiv Detail & Related papers (2024-02-21T01:30:03Z)
Hijacking Large Language Models via Adversarial In-Context Learning [8.15194326639149]
In-context learning (ICL) has emerged as a powerful paradigm leveraging LLMs for specific downstream tasks. Existing attacks are either easy to detect, rely on external models, or lack specificity towards ICL. This work introduces a novel transferable attack against ICL to address these issues.
arXiv Detail & Related papers (2023-11-16T15:01:48Z)
Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection [70.28425745910711]
Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following. This capability brings with it the risk of prompt injection attacks. We evaluate the robustness of instruction-following LLMs against such attacks.
arXiv Detail & Related papers (2023-08-17T06:21:50Z)
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection [64.67495502772866]
Large Language Models (LLMs) are increasingly being integrated into various applications. We show how attackers can override original instructions and employed controls using Prompt Injection attacks. We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities.
arXiv Detail & Related papers (2023-02-23T17:14:38Z)
Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks [67.86285142381644]
Recent advances in instruction-following large language models amplify the dual-use risks for malicious purposes. Dual-use is difficult to prevent as instruction-following capabilities now enable standard attacks from computer security. We show that instruction-following LLMs can produce targeted malicious content, including hate speech and scams.
arXiv Detail & Related papers (2023-02-11T15:57:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.