Related papers: Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection

Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection

URL: http://arxiv.org/abs/2308.10819v3
Date: Sat, 25 Nov 2023 00:25:36 GMT
Title: Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection
Authors: Zekun Li and Baolin Peng and Pengcheng He and Xifeng Yan
Abstract summary: Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following. This capability brings with it the risk of prompt injection attacks. We evaluate the robustness of instruction-following LLMs against such attacks.
Score: 70.28425745910711
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following, becoming increasingly crucial across various applications. However, this capability brings with it the risk of prompt injection attacks, where attackers inject instructions into LLMs' input to elicit undesirable actions or content. Understanding the robustness of LLMs against such attacks is vital for their safe implementation. In this work, we establish a benchmark to evaluate the robustness of instruction-following LLMs against prompt injection attacks. Our objective is to determine the extent to which LLMs can be influenced by injected instructions and their ability to differentiate between these injected and original target instructions. Through extensive experiments with leading instruction-following LLMs, we uncover significant vulnerabilities in their robustness to such attacks. Our results indicate that some models are overly tuned to follow any embedded instructions in the prompt, overly focusing on the latter parts of the prompt without fully grasping the entire context. By contrast, models with a better grasp of the context and instruction-following capabilities will potentially be more susceptible to compromise by injected instructions. This underscores the need to shift the focus from merely enhancing LLMs' instruction-following capabilities to improving their overall comprehension of prompts and discernment of instructions that are appropriate to follow. We hope our in-depth analysis offers insights into the underlying causes of these vulnerabilities, aiding in the development of future solutions. Code and data are available at https://github.com/Leezekun/instruction-following-robustness-eval

Related papers

TopicAttack: An Indirect Prompt Injection Attack via Topic Transition [71.81906608221038]
Large language models (LLMs) are vulnerable to indirect prompt injection attacks.<n>We propose TopicAttack, which prompts the LLM to generate a fabricated transition prompt that gradually shifts the topic toward the injected instruction.<n>We find that a higher injected-to-original attention ratio leads to a greater success probability, and our method achieves a much higher ratio than the baseline methods.
arXiv Detail & Related papers (2025-07-18T06:23:31Z)
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction [68.6543680065379]
Large language models (LLMs) are vulnerable to prompt injection attacks. We propose a novel defense method that leverages, rather than suppresses, the instruction-following abilities of LLMs.
arXiv Detail & Related papers (2025-04-29T07:13:53Z)
LLMs can be easily Confused by Instructional Distractions [16.060402139507644]
Large language models show exceptional skill in instruction following tasks. This strength can turn into a vulnerability when the models are required to disregard certain instructions. We introduce a novel benchmark, named DIM-Bench, specifically designed to assess LLMs' performance under instructional distraction.
arXiv Detail & Related papers (2025-02-05T04:52:57Z)
Find the Intention of Instruction: Comprehensive Evaluation of Instruction Understanding for Large Language Models [8.020688053947547]
One of the key strengths of Large Language Models (LLMs) is their ability to interact with humans by generating appropriate responses to given instructions. This ability, known as instruction-following capability, has established a foundation for the use of LLMs across various fields. We have noted that LLMs can become easily distracted by instruction-formatted statements, which may lead to an oversight of their instruction comprehension skills.
arXiv Detail & Related papers (2024-12-27T04:37:39Z)
Attention Tracker: Detecting Prompt Injection Attacks in LLMs [62.247841717696765]
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks. We introduce the concept of the distraction effect, where specific attention heads shift focus from the original instruction to the injected instruction. We propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks.
arXiv Detail & Related papers (2024-11-01T04:05:59Z)
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy [53.54777131440989]
Large Language Models (LLMs) are susceptible to security and safety threats. One major cause of these vulnerabilities is the lack of an instruction hierarchy. We introduce the instructional segment Embedding (ISE) technique, inspired by BERT, to modern large language models.
arXiv Detail & Related papers (2024-10-09T12:52:41Z)
Aligning LLMs to Be Robust Against Prompt Injection [55.07562650579068]
We show that alignment can be a powerful tool to make LLMs more robust against prompt injection attacks. Our method -- SecAlign -- first builds an alignment dataset by simulating prompt injection attacks. Our experiments show that SecAlign robustifies the LLM substantially with a negligible hurt on model utility.
arXiv Detail & Related papers (2024-10-07T19:34:35Z)
PROMPTFUZZ: Harnessing Fuzzing Techniques for Robust Testing of Prompt Injection in LLMs [16.296171008281775]
Large Language Models (LLMs) have gained widespread use in various applications due to their powerful capability to generate human-like text. prompt injection attacks involve overwriting a model's original instructions with malicious prompts to manipulate the generated text. We propose PROMPTFUZZ, a novel testing framework that leverages fuzzing techniques to assess the robustness of LLMs against prompt injection attacks.
arXiv Detail & Related papers (2024-09-23T06:08:32Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [79.0183835295533]
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities. Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content. We propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings.
arXiv Detail & Related papers (2023-12-21T01:08:39Z)
Hijacking Large Language Models via Adversarial In-Context Learning [10.416972293173993]
In-context learning (ICL) has emerged as a powerful paradigm leveraging LLMs for specific downstream tasks by utilizing labeled examples as demonstrations (demos) in the preconditioned prompts.<n>Existing attacks are either easy to detect, require a trigger in user input, or lack specificity towards ICL.<n>This work introduces a novel transferable prompt injection attack against ICL, aiming to hijack LLMs to generate the target output or elicit harmful responses.
arXiv Detail & Related papers (2023-11-16T15:01:48Z)
Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting [55.15697111170836]
This paper reveals the behaviors of large language models (LLMs) towards textitinductive instructions and enhance their truthfulness and helpfulness accordingly. After extensive human and automatic evaluations, we uncovered a universal vulnerability among LLMs in processing inductive instructions. We identify that different inductive styles affect the models' ability to identify the same underlying errors, and the complexity of the underlying assumptions also influences the model's performance.
arXiv Detail & Related papers (2023-05-23T06:38:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.