Fugu-MT 論文翻訳(概要): DRIP: Defending Prompt Injection via De-instruction Training and Residual Fusion Model Architecture

論文の概要: DRIP: Defending Prompt Injection via De-instruction Training and Residual Fusion Model Architecture

arxiv url: http://arxiv.org/abs/2511.00447v1
Date: Sat, 01 Nov 2025 08:26:37 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 16:37:26.781513
Title: DRIP: Defending Prompt Injection via De-instruction Training and Residual Fusion Model Architecture
Title（参考訳）: DRIP:デインストラクショントレーニングと残留核融合モデルアーキテクチャによるプロンプト注入の回避
Authors: Ruofan Liu, Yun Lin, Jin Song Dong,
Abstract要約: 大規模言語モデル(LLM)は、素晴らしい命令追従機能を示している。モデルの中心的な脆弱性は、セマンティックロール理解の欠如にある。本稿では,意味モデリングの観点からの訓練時間防衛であるDRIPを提案する。
参考スコア（独自算出の注目度）: 21.45291667976768
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) have demonstrated impressive instruction-following capabilities. However, these capabilities also expose models to prompt injection attacks, where maliciously crafted inputs overwrite or distract from the intended instructions. A core vulnerability lies in the model's lack of semantic role understanding: it cannot distinguish directive intent from descriptive content, leading it to execute instruction-like phrases embedded in data. We propose DRIP, a training-time defense grounded in a semantic modeling perspective, which enforces robust separation between instruction and data semantics without sacrificing utility. DRIP introduces two lightweight yet complementary mechanisms: (1) a token-wise de-instruction shift that performs semantic disentanglement, weakening directive semantics in data tokens while preserving content meaning; and (2) a residual fusion pathway that provides a persistent semantic anchor, reinforcing the influence of the true top-level instruction during generation. Experimental results on LLaMA-8B and Mistral-7B across three prompt injection benchmarks (SEP, AlpacaFarm, and InjecAgent) demonstrate that DRIP outperforms state-of-the-art defenses, including StruQ, SecAlign, ISE, and PFT, improving role separation by 49%, and reducing attack success rate by 66% for adaptive attacks. Meanwhile, DRIP's utility is on par with the undefended model across AlpacaEval, IFEval, and MT-Bench. Our findings underscore the power of lightweight representation edits and role-aware supervision in securing LLMs against adaptive prompt injection.
Abstract（参考訳）: 大規模言語モデル(LLM)は、素晴らしい命令追従機能を示している。しかし、これらの機能は、悪意ある入力を上書きしたり、意図した命令に注意をそらすような、インジェクション攻撃を促すモデルを公開する。ディレクティブインテントと記述的コンテントを区別することはできず、データに埋め込まれた命令のようなフレーズを実行する。本稿では,意味モデリングの観点から基礎をおく訓練時間防衛であるDRIPを提案し,その効果を犠牲にすることなく,命令とデータセマンティクスを堅牢に分離する。 DRIPは,(1)意味的不整合,(2)意味的意味を保ちながらデータトークンにおける指示的意味論の弱化,(2)永続的な意味的アンカーを提供する残留融合経路,の2つの軽量かつ補完的なメカニズムを導入し,生成中の真のトップレベル命令の影響を補強する。 LLaMA-8BとMistral-7Bの3つのインジェクションベンチマーク(SEP、AlpacaFarm、InjecAgent)による実験結果から、DRIPはStruQ、SecAlign、ISE、PFTといった最先端の防御よりも優れ、役割分離を49%改善し、アダプティブアタックに対して攻撃成功率を66%低減することが示された。一方、DRIPの実用性は、AlpacaEval、IFEval、MT-Benchにまたがる無防備なモデルと同等である。適応的プロンプト注入に対するLSMの確保において,軽量な表現編集とロール・アウェア・インスペクションの能力について検討した。

論文の概要: DRIP: Defending Prompt Injection via De-instruction Training and Residual Fusion Model Architecture

関連論文リスト