Defending against Indirect Prompt Injection by Instruction Detection
- URL: http://arxiv.org/abs/2505.06311v2
- Date: Wed, 17 Sep 2025 09:17:08 GMT
- Title: Defending against Indirect Prompt Injection by Instruction Detection
- Authors: Tongyu Wen, Chenglong Wang, Xiyuan Yang, Haoyu Tang, Yueqi Xie, Lingjuan Lyu, Zhicheng Dou, Fangzhao Wu,
- Abstract summary: InstructDetector is a novel detection-based approach that leverages the behavioral states of LLMs to identify potential IPI attacks.<n>InstructDetector achieves a detection accuracy of 99.60% in the in-domain setting and 96.90% in the out-of-domain setting, and reduces the attack success rate to just 0.03% on the BIPIA benchmark.
- Score: 109.30156975159561
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The integration of Large Language Models (LLMs) with external sources is becoming increasingly common, with Retrieval-Augmented Generation (RAG) being a prominent example. However, this integration introduces vulnerabilities of Indirect Prompt Injection (IPI) attacks, where hidden instructions embedded in external data can manipulate LLMs into executing unintended or harmful actions. We recognize that IPI attacks fundamentally rely on the presence of instructions embedded within external content, which can alter the behavioral states of LLMs. Can the effective detection of such state changes help us defend against IPI attacks? In this paper, we propose InstructDetector, a novel detection-based approach that leverages the behavioral states of LLMs to identify potential IPI attacks. Specifically, we demonstrate the hidden states and gradients from intermediate layers provide highly discriminative features for instruction detection. By effectively combining these features, InstructDetector achieves a detection accuracy of 99.60% in the in-domain setting and 96.90% in the out-of-domain setting, and reduces the attack success rate to just 0.03% on the BIPIA benchmark. The code is publicly available at https://github.com/MYVAE/Instruction-detection.
Related papers
- Silent Egress: When Implicit Prompt Injection Makes LLM Agents Leak Without a Trace [0.0]
We show that adversarial instructions embedded in automatically generated URL previews can introduce a system-level risk that we refer to as silent egress.<n>Using a fully local and reproducible testbed, we demonstrate that a malicious web page can induce an agent to issue outbound requests that exfiltrate sensitive runtime context.<n>In 480 experimental runs with a qwen2.5:7b-based agent, the attack succeeds with high probability (P (egress) =0.89), and 95% of successful attacks are not detected by output-based safety checks.
arXiv Detail & Related papers (2026-02-25T22:26:23Z) - Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs [2.2448294058653455]
adversarial prompts exploit indirect input channels such as emails or user-generated content to circumvent alignment safeguards.<n>We propose Zero-Shot Embedding Drift Detection (ZEDD), a lightweight, low-engineering-overhead framework that identifies both direct and indirect prompt injection attempts.<n>ZEDD operates without requiring access to model internals, prior knowledge of attack types, or task-specific retraining.
arXiv Detail & Related papers (2026-01-18T11:33:35Z) - ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack [52.17935054046577]
We present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks.<n>ReasAlign incorporates structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks.
arXiv Detail & Related papers (2026-01-15T08:23:38Z) - Attention is All You Need to Defend Against Indirect Prompt Injection Attacks in LLMs [27.395843922014294]
Large Language Models (LLMs) have been integrated into many applications (e.g., web agents) to perform more sophisticated tasks.<n>This paper presents Rennervate, a defense framework to detect and prevent IPI attacks.
arXiv Detail & Related papers (2025-12-09T09:44:13Z) - PIShield: Detecting Prompt Injection Attacks via Intrinsic LLM Features [33.95073302161128]
Existing prompt injection detection methods often have sub-optimal performance and/or high computational overhead.<n>We propose PIShield, a detection method that is both effective and efficient.<n>We show that PIShield is both highly effective and efficient, substantially outperforming existing methods.
arXiv Detail & Related papers (2025-10-15T18:34:49Z) - TopicAttack: An Indirect Prompt Injection Attack via Topic Transition [71.81906608221038]
Large language models (LLMs) are vulnerable to indirect prompt injection attacks.<n>We propose TopicAttack, which prompts the LLM to generate a fabricated transition prompt that gradually shifts the topic toward the injected instruction.<n>We find that a higher injected-to-original attention ratio leads to a greater success probability, and our method achieves a much higher ratio than the baseline methods.
arXiv Detail & Related papers (2025-07-18T06:23:31Z) - CachePrune: Neural-Based Attribution Defense Against Indirect Prompt Injection Attacks [47.62236306990252]
Large Language Models (LLMs) are susceptible to indirect prompt injection attacks.<n>This vulnerability stems from LLMs' inability to distinguish between data and instructions within a prompt.<n>We propose CachePrune that defends against this attack by identifying and pruning task-triggering neurons.
arXiv Detail & Related papers (2025-04-29T23:42:21Z) - Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction [68.6543680065379]
Large language models (LLMs) are vulnerable to prompt injection attacks.<n>We propose a novel defense method that leverages, rather than suppresses, the instruction-following abilities of LLMs.
arXiv Detail & Related papers (2025-04-29T07:13:53Z) - DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks [101.52204404377039]
LLM-integrated applications and agents are vulnerable to prompt injection attacks.<n>A detection method aims to determine whether a given input is contaminated by an injected prompt.<n>We propose DataSentinel, a game-theoretic method to detect prompt injection attacks.
arXiv Detail & Related papers (2025-04-15T16:26:21Z) - HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States [17.601328965546617]
We investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference.<n>Our findings reveal that LVLMs exhibit distinct activation patterns when processing unsafe prompts.<n>We introduce HiddenDetect, a novel tuning-free framework that harnesses internal model activations to enhance safety.
arXiv Detail & Related papers (2025-02-20T17:14:34Z) - Attention Tracker: Detecting Prompt Injection Attacks in LLMs [62.247841717696765]
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks.<n>We introduce the concept of the distraction effect, where specific attention heads shift focus from the original instruction to the injected instruction.<n>We propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks.
arXiv Detail & Related papers (2024-11-01T04:05:59Z) - Revisiting Backdoor Attacks against Large Vision-Language Models from Domain Shift [104.76588209308666]
This paper explores backdoor attacks in LVLM instruction tuning across mismatched training and testing domains.<n>We introduce a new evaluation dimension, backdoor domain generalization, to assess attack robustness.<n>We propose a multimodal attribution backdoor attack (MABA) that injects domain-agnostic triggers into critical areas.
arXiv Detail & Related papers (2024-06-27T02:31:03Z) - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [79.0183835295533]
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities.<n>Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content.<n>We propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings.
arXiv Detail & Related papers (2023-12-21T01:08:39Z) - ASSET: Robust Backdoor Data Detection Across a Multiplicity of Deep
Learning Paradigms [39.753721029332326]
Backdoor data detection is traditionally studied in an end-to-end supervised learning (SL) setting.
Recent years have seen the proliferating adoption of self-supervised learning (SSL) and transfer learning (TL) due to their lesser need for labeled data.
We show that the performance of most existing detection methods varies significantly across different attacks and poison ratios, and all fail on the state-of-the-art clean-label attack.
arXiv Detail & Related papers (2023-02-22T14:43:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.