Evaluating the Instruction-Following Robustness of Large Language Models
to Prompt Injection
- URL: http://arxiv.org/abs/2308.10819v3
- Date: Sat, 25 Nov 2023 00:25:36 GMT
- Title: Evaluating the Instruction-Following Robustness of Large Language Models
to Prompt Injection
- Authors: Zekun Li and Baolin Peng and Pengcheng He and Xifeng Yan
- Abstract summary: Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following.
This capability brings with it the risk of prompt injection attacks.
We evaluate the robustness of instruction-following LLMs against such attacks.
- Score: 70.28425745910711
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated exceptional proficiency in
instruction-following, becoming increasingly crucial across various
applications. However, this capability brings with it the risk of prompt
injection attacks, where attackers inject instructions into LLMs' input to
elicit undesirable actions or content. Understanding the robustness of LLMs
against such attacks is vital for their safe implementation. In this work, we
establish a benchmark to evaluate the robustness of instruction-following LLMs
against prompt injection attacks. Our objective is to determine the extent to
which LLMs can be influenced by injected instructions and their ability to
differentiate between these injected and original target instructions. Through
extensive experiments with leading instruction-following LLMs, we uncover
significant vulnerabilities in their robustness to such attacks. Our results
indicate that some models are overly tuned to follow any embedded instructions
in the prompt, overly focusing on the latter parts of the prompt without fully
grasping the entire context. By contrast, models with a better grasp of the
context and instruction-following capabilities will potentially be more
susceptible to compromise by injected instructions. This underscores the need
to shift the focus from merely enhancing LLMs' instruction-following
capabilities to improving their overall comprehension of prompts and
discernment of instructions that are appropriate to follow. We hope our
in-depth analysis offers insights into the underlying causes of these
vulnerabilities, aiding in the development of future solutions. Code and data
are available at
https://github.com/Leezekun/instruction-following-robustness-eval
Related papers
- Attention Tracker: Detecting Prompt Injection Attacks in LLMs [62.247841717696765]
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks.
We introduce the concept of the distraction effect, where specific attention heads shift focus from the original instruction to the injected instruction.
We propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks.
arXiv Detail & Related papers (2024-11-01T04:05:59Z) - Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy [53.54777131440989]
Large Language Models (LLMs) are susceptible to security and safety threats.
One major cause of these vulnerabilities is the lack of an instruction hierarchy.
We introduce the instructional segment Embedding (ISE) technique, inspired by BERT, to modern large language models.
arXiv Detail & Related papers (2024-10-09T12:52:41Z) - Aligning LLMs to Be Robust Against Prompt Injection [55.07562650579068]
We show that alignment can be a powerful tool to make LLMs more robust against prompt injection attacks.
Our method -- SecAlign -- first builds an alignment dataset by simulating prompt injection attacks.
Our experiments show that SecAlign robustifies the LLM substantially with a negligible hurt on model utility.
arXiv Detail & Related papers (2024-10-07T19:34:35Z) - PROMPTFUZZ: Harnessing Fuzzing Techniques for Robust Testing of Prompt Injection in LLMs [16.296171008281775]
Large Language Models (LLMs) have gained widespread use in various applications due to their powerful capability to generate human-like text.
prompt injection attacks involve overwriting a model's original instructions with malicious prompts to manipulate the generated text.
We propose PROMPTFUZZ, a novel testing framework that leverages fuzzing techniques to assess the robustness of LLMs against prompt injection attacks.
arXiv Detail & Related papers (2024-09-23T06:08:32Z) - Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z) - Enhancing Large Language Models Against Inductive Instructions with
Dual-critique Prompting [55.15697111170836]
This paper reveals the behaviors of large language models (LLMs) towards textitinductive instructions and enhance their truthfulness and helpfulness accordingly.
After extensive human and automatic evaluations, we uncovered a universal vulnerability among LLMs in processing inductive instructions.
We identify that different inductive styles affect the models' ability to identify the same underlying errors, and the complexity of the underlying assumptions also influences the model's performance.
arXiv Detail & Related papers (2023-05-23T06:38:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.