Fugu-MT 論文翻訳(概要): AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks

論文の概要: AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks

arxiv url: http://arxiv.org/abs/2602.13597v1
Date: Sat, 14 Feb 2026 04:35:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-17 14:17:28.2288
Title: AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks
Title（参考訳）: AlignSentinel: プロンプトインジェクション攻撃のアライメント・アウェア検出
Authors: Yuqi Jia, Ruiqi Wang, Xilong Wang, Chong Xiang, Neil Gong,
Abstract要約: プロンプトインジェクション攻撃はLLMの入力に悪意のある命令を挿入し、意図した命令ではなくアタッカー・チョーゼンタスクに誘導する。既存の検出防御は、通常、任意の入力を悪意のある命令で分類する。本研究では,命令階層を記述し,不整合命令の入力,整合命令の入力,非整合入力の3つのカテゴリを区別する。
参考スコア（独自算出の注目度）: 20.9342308883234
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: % Prompt injection attacks insert malicious instructions into an LLM's input to steer it toward an attacker-chosen task instead of the intended one. Existing detection defenses typically classify any input with instruction as malicious, leading to misclassification of benign inputs containing instructions that align with the intended task. In this work, we account for the instruction hierarchy and distinguish among three categories: inputs with misaligned instructions, inputs with aligned instructions, and non-instruction inputs. We introduce AlignSentinel, a three-class classifier that leverages features derived from LLM's attention maps to categorize inputs accordingly. To support evaluation, we construct the first systematic benchmark containing inputs from all three categories. Experiments on both our benchmark and existing ones--where inputs with aligned instructions are largely absent--show that AlignSentinel accurately detects inputs with misaligned instructions and substantially outperforms baselines.
Abstract（参考訳）: % Prompt インジェクション攻撃は LLM の入力に悪意のある命令を挿入し、意図した命令ではなくアタッカー・チョーゼンタスクに誘導する。既存の検出防御は、任意の入力を悪意のある命令で分類し、意図したタスクと整合した命令を含む良性入力を誤分類する。本研究では,命令階層を記述し,不整合命令の入力,整合命令の入力,非整合入力の3つのカテゴリを区別する。本稿では,LLMのアテンションマップから派生した特徴を利用して入力を分類する3クラス分類器AlignSentinelを紹介する。評価を支援するため、3つのカテゴリの入力を含む最初の体系的なベンチマークを構築した。 AlignSentinelが不整合命令で入力を正確に検出し、ベースラインを大幅に上回ることを示す。

論文の概要: AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks

関連論文リスト