Evaluating the Instruction-Following Robustness of Large Language Models
to Prompt Injection
- URL: http://arxiv.org/abs/2308.10819v3
- Date: Sat, 25 Nov 2023 00:25:36 GMT
- Title: Evaluating the Instruction-Following Robustness of Large Language Models
to Prompt Injection
- Authors: Zekun Li and Baolin Peng and Pengcheng He and Xifeng Yan
- Abstract summary: Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following.
This capability brings with it the risk of prompt injection attacks.
We evaluate the robustness of instruction-following LLMs against such attacks.
- Score: 70.28425745910711
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated exceptional proficiency in
instruction-following, becoming increasingly crucial across various
applications. However, this capability brings with it the risk of prompt
injection attacks, where attackers inject instructions into LLMs' input to
elicit undesirable actions or content. Understanding the robustness of LLMs
against such attacks is vital for their safe implementation. In this work, we
establish a benchmark to evaluate the robustness of instruction-following LLMs
against prompt injection attacks. Our objective is to determine the extent to
which LLMs can be influenced by injected instructions and their ability to
differentiate between these injected and original target instructions. Through
extensive experiments with leading instruction-following LLMs, we uncover
significant vulnerabilities in their robustness to such attacks. Our results
indicate that some models are overly tuned to follow any embedded instructions
in the prompt, overly focusing on the latter parts of the prompt without fully
grasping the entire context. By contrast, models with a better grasp of the
context and instruction-following capabilities will potentially be more
susceptible to compromise by injected instructions. This underscores the need
to shift the focus from merely enhancing LLMs' instruction-following
capabilities to improving their overall comprehension of prompts and
discernment of instructions that are appropriate to follow. We hope our
in-depth analysis offers insights into the underlying causes of these
vulnerabilities, aiding in the development of future solutions. Code and data
are available at
https://github.com/Leezekun/instruction-following-robustness-eval
Related papers
- LLMs can be easily Confused by Instructional Distractions [16.060402139507644]
Large language models show exceptional skill in instruction following tasks.
This strength can turn into a vulnerability when the models are required to disregard certain instructions.
We introduce a novel benchmark, named DIM-Bench, specifically designed to assess LLMs' performance under instructional distraction.
arXiv Detail & Related papers (2025-02-05T04:52:57Z) - Find the Intention of Instruction: Comprehensive Evaluation of Instruction Understanding for Large Language Models [8.020688053947547]
One of the key strengths of Large Language Models (LLMs) is their ability to interact with humans by generating appropriate responses to given instructions.
This ability, known as instruction-following capability, has established a foundation for the use of LLMs across various fields.
We have noted that LLMs can become easily distracted by instruction-formatted statements, which may lead to an oversight of their instruction comprehension skills.
arXiv Detail & Related papers (2024-12-27T04:37:39Z) - Attention Tracker: Detecting Prompt Injection Attacks in LLMs [62.247841717696765]
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks.
We introduce the concept of the distraction effect, where specific attention heads shift focus from the original instruction to the injected instruction.
We propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks.
arXiv Detail & Related papers (2024-11-01T04:05:59Z) - Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy [53.54777131440989]
Large Language Models (LLMs) are susceptible to security and safety threats.
One major cause of these vulnerabilities is the lack of an instruction hierarchy.
We introduce the instructional segment Embedding (ISE) technique, inspired by BERT, to modern large language models.
arXiv Detail & Related papers (2024-10-09T12:52:41Z) - Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z) - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [79.0183835295533]
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities.
Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content.
We propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings.
arXiv Detail & Related papers (2023-12-21T01:08:39Z) - Enhancing Large Language Models Against Inductive Instructions with
Dual-critique Prompting [55.15697111170836]
This paper reveals the behaviors of large language models (LLMs) towards textitinductive instructions and enhance their truthfulness and helpfulness accordingly.
After extensive human and automatic evaluations, we uncovered a universal vulnerability among LLMs in processing inductive instructions.
We identify that different inductive styles affect the models' ability to identify the same underlying errors, and the complexity of the underlying assumptions also influences the model's performance.
arXiv Detail & Related papers (2023-05-23T06:38:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.