Evaluating the Instruction-Following Robustness of Large Language Models
to Prompt Injection
- URL: http://arxiv.org/abs/2308.10819v3
- Date: Sat, 25 Nov 2023 00:25:36 GMT
- Title: Evaluating the Instruction-Following Robustness of Large Language Models
to Prompt Injection
- Authors: Zekun Li and Baolin Peng and Pengcheng He and Xifeng Yan
- Abstract summary: Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following.
This capability brings with it the risk of prompt injection attacks.
We evaluate the robustness of instruction-following LLMs against such attacks.
- Score: 70.28425745910711
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated exceptional proficiency in
instruction-following, becoming increasingly crucial across various
applications. However, this capability brings with it the risk of prompt
injection attacks, where attackers inject instructions into LLMs' input to
elicit undesirable actions or content. Understanding the robustness of LLMs
against such attacks is vital for their safe implementation. In this work, we
establish a benchmark to evaluate the robustness of instruction-following LLMs
against prompt injection attacks. Our objective is to determine the extent to
which LLMs can be influenced by injected instructions and their ability to
differentiate between these injected and original target instructions. Through
extensive experiments with leading instruction-following LLMs, we uncover
significant vulnerabilities in their robustness to such attacks. Our results
indicate that some models are overly tuned to follow any embedded instructions
in the prompt, overly focusing on the latter parts of the prompt without fully
grasping the entire context. By contrast, models with a better grasp of the
context and instruction-following capabilities will potentially be more
susceptible to compromise by injected instructions. This underscores the need
to shift the focus from merely enhancing LLMs' instruction-following
capabilities to improving their overall comprehension of prompts and
discernment of instructions that are appropriate to follow. We hope our
in-depth analysis offers insights into the underlying causes of these
vulnerabilities, aiding in the development of future solutions. Code and data
are available at
https://github.com/Leezekun/instruction-following-robustness-eval
Related papers
- Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z) - Defending Against Indirect Prompt Injection Attacks With Spotlighting [11.127479817618692]
In common applications, multiple inputs can be processed by concatenating them together into a single stream of text.
Indirect prompt injection attacks take advantage of this vulnerability by embedding adversarial instructions into untrusted data being processed alongside user commands.
We introduce spotlighting, a family of prompt engineering techniques that can be used to improve LLMs' ability to distinguish among multiple sources of input.
arXiv Detail & Related papers (2024-03-20T15:26:23Z) - Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks
Against LLM-Integrated Applications [0.0]
This paper introduces the 'Signed-Prompt' method as a novel solution for prompt injection attacks.
The study involves signing sensitive instructions within command segments by authorized users, enabling the LLM to discern trusted instruction sources.
Experiments demonstrate the effectiveness of the Signed-Prompt method, showing substantial resistance to various types of prompt injection attacks.
arXiv Detail & Related papers (2024-01-15T11:44:18Z) - Benchmarking and Defending Against Indirect Prompt Injection Attacks on
Large Language Models [82.98081731588717]
Integration of large language models with external content exposes applications to indirect prompt injection attacks.
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to evaluate the risk of such attacks.
We develop two black-box methods based on prompt learning and a white-box defense method based on fine-tuning with adversarial training.
arXiv Detail & Related papers (2023-12-21T01:08:39Z) - Backdoor Activation Attack: Attack Large Language Models using
Activation Steering for Safety-Alignment [36.91218391728405]
This paper studies the vulnerability of Large Language Models' safety alignment.
Existing attack methods on LLMs rely on poisoned training data or the injection of malicious prompts.
Inspired by recent success in modifying model behavior through steering vectors without the need for optimization, we draw on its effectiveness in red-teaming LLMs.
Our experiment results show that activation attacks are highly effective and add little or no overhead to attack efficiency.
arXiv Detail & Related papers (2023-11-15T23:07:40Z) - Enhancing Large Language Models Against Inductive Instructions with
Dual-critique Prompting [55.15697111170836]
This paper reveals the behaviors of large language models (LLMs) towards textitinductive instructions and enhance their truthfulness and helpfulness accordingly.
After extensive human and automatic evaluations, we uncovered a universal vulnerability among LLMs in processing inductive instructions.
We identify that different inductive styles affect the models' ability to identify the same underlying errors, and the complexity of the underlying assumptions also influences the model's performance.
arXiv Detail & Related papers (2023-05-23T06:38:20Z) - Not what you've signed up for: Compromising Real-World LLM-Integrated
Applications with Indirect Prompt Injection [64.67495502772866]
Large Language Models (LLMs) are increasingly being integrated into various applications.
We show how attackers can override original instructions and employed controls using Prompt Injection attacks.
We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities.
arXiv Detail & Related papers (2023-02-23T17:14:38Z) - Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard
Security Attacks [67.86285142381644]
Recent advances in instruction-following large language models amplify the dual-use risks for malicious purposes.
Dual-use is difficult to prevent as instruction-following capabilities now enable standard attacks from computer security.
We show that instruction-following LLMs can produce targeted malicious content, including hate speech and scams.
arXiv Detail & Related papers (2023-02-11T15:57:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.