DRIP: Defending Prompt Injection via De-instruction Training and Residual Fusion Model Architecture
- URL: http://arxiv.org/abs/2511.00447v1
- Date: Sat, 01 Nov 2025 08:26:37 GMT
- Title: DRIP: Defending Prompt Injection via De-instruction Training and Residual Fusion Model Architecture
- Authors: Ruofan Liu, Yun Lin, Jin Song Dong,
- Abstract summary: Large language models (LLMs) have demonstrated impressive instruction-following capabilities.<n>A core vulnerability lies in the model's lack of semantic role understanding.<n>We propose DRIP, a training-time defense grounded in a semantic modeling perspective.
- Score: 21.45291667976768
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models (LLMs) have demonstrated impressive instruction-following capabilities. However, these capabilities also expose models to prompt injection attacks, where maliciously crafted inputs overwrite or distract from the intended instructions. A core vulnerability lies in the model's lack of semantic role understanding: it cannot distinguish directive intent from descriptive content, leading it to execute instruction-like phrases embedded in data. We propose DRIP, a training-time defense grounded in a semantic modeling perspective, which enforces robust separation between instruction and data semantics without sacrificing utility. DRIP introduces two lightweight yet complementary mechanisms: (1) a token-wise de-instruction shift that performs semantic disentanglement, weakening directive semantics in data tokens while preserving content meaning; and (2) a residual fusion pathway that provides a persistent semantic anchor, reinforcing the influence of the true top-level instruction during generation. Experimental results on LLaMA-8B and Mistral-7B across three prompt injection benchmarks (SEP, AlpacaFarm, and InjecAgent) demonstrate that DRIP outperforms state-of-the-art defenses, including StruQ, SecAlign, ISE, and PFT, improving role separation by 49%, and reducing attack success rate by 66% for adaptive attacks. Meanwhile, DRIP's utility is on par with the undefended model across AlpacaEval, IFEval, and MT-Bench. Our findings underscore the power of lightweight representation edits and role-aware supervision in securing LLMs against adaptive prompt injection.
Related papers
- Defense Against Indirect Prompt Injection via Tool Result Parsing [5.69701430275527]
LLM agents face an escalating threat from indirect prompt injection.<n>This vulnerability poses a significant risk as agents gain more direct control over physical environments.<n>We propose a novel method that provides LLMs with precise data via tool result parsing while effectively filtering out injected malicious code.
arXiv Detail & Related papers (2026-01-08T10:21:56Z) - Know Thy Enemy: Securing LLMs Against Prompt Injection via Diverse Data Synthesis and Instruction-Level Chain-of-Thought Learning [31.790490397086856]
Large language model (LLM)-integrated applications have become increasingly prevalent, yet face critical security vulnerabilities from prompt injection (PI) attacks.<n>We propose InstruCoT, a model enhancement method for PI defense that synthesizes diverse training data and employs instruction-level chain-of-thought fine-tuning.
arXiv Detail & Related papers (2026-01-08T07:25:27Z) - Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis [48.70474961584997]
Indirect prompt injection attacks (IPIAs) pose a critical threat to large language models (LLMs)<n>We present IntentGuard, a general defense framework based on instruction-following intent analysis.
arXiv Detail & Related papers (2025-11-30T16:29:04Z) - Semantics as a Shield: Label Disguise Defense (LDD) against Prompt Injection in LLM Sentiment Classification [5.963719408944521]
This paper introduces Label Disguise Defense (LDD), a lightweight strategy that conceals true labels by replacing them with semantically transformed alias labels.<n>We evaluate LDD across nine state-of-the-art models, including GPT-5, GPT-4o, LLaMA3.2, Gemma3, and Mistral variants.
arXiv Detail & Related papers (2025-11-23T20:16:51Z) - TopicAttack: An Indirect Prompt Injection Attack via Topic Transition [92.26240528996443]
Large language models (LLMs) are vulnerable to indirect prompt injection attacks.<n>We propose TopicAttack, which prompts the LLM to generate a fabricated transition prompt that gradually shifts the topic toward the injected instruction.<n>We find that a higher injected-to-original attention ratio leads to a greater success probability, and our method achieves a much higher ratio than the baseline methods.
arXiv Detail & Related papers (2025-07-18T06:23:31Z) - Robust Anti-Backdoor Instruction Tuning in LVLMs [53.766434746801366]
We introduce a lightweight, certified-agnostic defense framework for large visual language models (LVLMs)<n>Our framework finetunes only adapter modules and text embedding layers under instruction tuning.<n>Experiments against seven attacks on Flickr30k and MSCOCO demonstrate that ours reduces their attack success rate to nearly zero.
arXiv Detail & Related papers (2025-06-04T01:23:35Z) - CachePrune: Neural-Based Attribution Defense Against Indirect Prompt Injection Attacks [47.62236306990252]
Large Language Models (LLMs) are susceptible to indirect prompt injection attacks.<n>This vulnerability stems from LLMs' inability to distinguish between data and instructions within a prompt.<n>We propose CachePrune that defends against this attack by identifying and pruning task-triggering neurons.
arXiv Detail & Related papers (2025-04-29T23:42:21Z) - Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction [68.6543680065379]
Large language models (LLMs) are vulnerable to prompt injection attacks.<n>We propose a novel defense method that leverages, rather than suppresses, the instruction-following abilities of LLMs.
arXiv Detail & Related papers (2025-04-29T07:13:53Z) - ASIDE: Architectural Separation of Instructions and Data in Language Models [87.16417239344285]
ASIDE allows language models to clearly separate instructions and data at the level of embeddings.<n>We demonstrate experimentally across a range of models, instruction-tuning LLMs with ASIDE leads to highly increased instruction-data separation without a loss in model utility.<n>We provide insights into the mechanism underlying our method through an analysis of the model representations.
arXiv Detail & Related papers (2025-03-13T17:17:17Z) - Evaluating the Instruction-Following Robustness of Large Language Models
to Prompt Injection [70.28425745910711]
Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following.
This capability brings with it the risk of prompt injection attacks.
We evaluate the robustness of instruction-following LLMs against such attacks.
arXiv Detail & Related papers (2023-08-17T06:21:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.