Fugu-MT 論文翻訳(概要): Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction

論文の概要: Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction

arxiv url: http://arxiv.org/abs/2504.20472v1
Date: Tue, 29 Apr 2025 07:13:53 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-02 19:15:54.782238
Title: Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction
Title（参考訳）: 参照によるロバストネス:エクスカレートインストラクションを参照してプロンプトインジェクションアタックに対する防御
Authors: Yulin Chen, Haoran Li, Yuan Sui, Yue Liu, Yufei He, Yangqiu Song, Bryan Hooi,
Abstract要約: 大型言語モデル(LLM)はインジェクション攻撃に弱い。本研究では,LLMの命令追従能力を抑えるのではなく,新たな防御手法を提案する。
参考スコア（独自算出の注目度）: 68.6543680065379
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have demonstrated impressive performance and have come to dominate the field of natural language processing (NLP) across various tasks. However, due to their strong instruction-following capabilities and inability to distinguish between instructions and data content, LLMs are vulnerable to prompt injection attacks. These attacks manipulate LLMs into deviating from the original input instructions and executing maliciously injected instructions within data content, such as web documents retrieved from search engines. Existing defense methods, including prompt-engineering and fine-tuning approaches, typically instruct models to follow the original input instructions while suppressing their tendencies to execute injected instructions. However, our experiments reveal that suppressing instruction-following tendencies is challenging. Through analyzing failure cases, we observe that although LLMs tend to respond to any recognized instructions, they are aware of which specific instructions they are executing and can correctly reference them within the original prompt. Motivated by these findings, we propose a novel defense method that leverages, rather than suppresses, the instruction-following abilities of LLMs. Our approach prompts LLMs to generate responses that include both answers and their corresponding instruction references. Based on these references, we filter out answers not associated with the original input instructions. Comprehensive experiments demonstrate that our method outperforms prompt-engineering baselines and achieves performance comparable to fine-tuning methods, reducing the attack success rate (ASR) to 0 percent in some scenarios. Moreover, our approach has minimal impact on overall utility.
Abstract（参考訳）: 大規模言語モデル(LLM)は目覚ましい性能を示し、様々なタスクで自然言語処理(NLP)の分野を支配している。しかし、命令追従能力が強く、命令とデータ内容の区別ができないため、LSMはインジェクション攻撃に弱い。これらの攻撃はLSMを操作して元の入力命令から逸脱させ、検索エンジンから取得したWebドキュメントのようなデータコンテンツ内で悪意あるインジェクションを実行する。プロンプトエンジニアリングや微調整のアプローチを含む既存の防御手法は、通常、モデルに元の入力命令に従うように指示する一方で、インジェクション命令を実行する傾向を抑える。しかし,本実験の結果,命令追従傾向の抑制は困難であることが判明した。故障事例を解析した結果,LSMは認識された命令に応答する傾向にあるものの,どの命令を実行しているかを認識し,元のプロンプト内で正しく参照可能であることがわかった。これらの知見に触発され, LLMの指示追従能力を抑えるのではなく, 新たな防御手法を提案する。提案手法はLLMに対して,回答とそれに対応する命令参照の両方を含む応答を生成するよう促す。これらの参照に基づいて、元の入力命令とは無関係な回答をフィルタリングする。包括的実験により,本手法は,いくつかのシナリオにおいて,アタック成功率 (ASR) を0パーセントに低下させるとともに,即時エンジニアリングのベースラインを向上し,微調整手法に匹敵する性能を達成することを示した。さらに、私たちのアプローチは全体のユーティリティに最小限の影響を与えます。

論文の概要: Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction

関連論文リスト