Fugu-MT 論文翻訳(概要): Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods

論文の概要: Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods

arxiv url: http://arxiv.org/abs/2510.03705v1
Date: Sat, 04 Oct 2025 07:11:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.205624
Title: Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods
Title（参考訳）: バックドア式プロンプトインジェクション攻撃は防御法を多用する
Authors: Yulin Chen, Haoran Li, Yuan Sui, Yangqiu Song, Bryan Hooi,
Abstract要約: 大型言語モデル(LLM)はインジェクション攻撃に弱い。本稿では,即時噴射防御法を無効化するより悪質な攻撃について検討する。バックドアによるプロンプトインジェクション攻撃は、以前のプロンプトインジェクション攻撃よりも有害である。
参考スコア（独自算出の注目度）: 95.54363609024847
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the development of technology, large language models (LLMs) have dominated the downstream natural language processing (NLP) tasks. However, because of the LLMs' instruction-following abilities and inability to distinguish the instructions in the data content, such as web pages from search engines, the LLMs are vulnerable to prompt injection attacks. These attacks trick the LLMs into deviating from the original input instruction and executing the attackers' target instruction. Recently, various instruction hierarchy defense strategies are proposed to effectively defend against prompt injection attacks via fine-tuning. In this paper, we explore more vicious attacks that nullify the prompt injection defense methods, even the instruction hierarchy: backdoor-powered prompt injection attacks, where the attackers utilize the backdoor attack for prompt injection attack purposes. Specifically, the attackers poison the supervised fine-tuning samples and insert the backdoor into the model. Once the trigger is activated, the backdoored model executes the injected instruction surrounded by the trigger. We construct a benchmark for comprehensive evaluation. Our experiments demonstrate that backdoor-powered prompt injection attacks are more harmful than previous prompt injection attacks, nullifying existing prompt injection defense methods, even the instruction hierarchy techniques.
Abstract（参考訳）: 技術の発展に伴い、大規模言語モデル(LLM)は下流自然言語処理(NLP)タスクを支配してきた。しかし、LLMの命令追従能力と、検索エンジンからのWebページなどのデータコンテンツ中の命令を区別できないため、LSMはインジェクション攻撃を早める脆弱性がある。これらの攻撃はLLMを騙して元の入力命令から逸脱させ、攻撃者のターゲット命令を実行する。近年,ファインタニングによるインジェクション攻撃を効果的に防ぐために,様々な命令階層防衛戦略が提案されている。本稿では,攻撃者が攻撃目的のためにバックドア攻撃を利用するバックドア・インジェクション攻撃(バックドア・インジェクション・インジェクション・インジェクション・アタック・アタック)について,インジェクション・インジェクション・ディフェンス・メソッド,さらにはインストラクション・インジェクション・インジェクション・インジェクション・インジェクション・アタック・アタック(インジェクション・インジェクション・アタック・アタック・アタック・アタック・アタック)を無効にする,より悪質な攻撃について検討する。具体的には、攻撃者は監督された微調整サンプルを毒殺し、バックドアをモデルに挿入する。トリガーがアクティベートされると、バックドアモデルがトリガーに囲まれたインジェクション命令を実行する。総合評価のためのベンチマークを構築した。提案実験は,従来のインパルスインジェクション攻撃よりもバックドアを用いたインジェクション攻撃の方が有害であることを示し,既存のインジェクション防御手法,さらには命令階層化手法を無効にしている。

論文の概要: Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods

関連論文リスト