Fugu-MT 論文翻訳(概要): The Injection Paradox: Brand-Level Suppression in Safety-Trained LLM Recommendations via RAG Context Injection

論文の概要: The Injection Paradox: Brand-Level Suppression in Safety-Trained LLM Recommendations via RAG Context Injection

arxiv url: http://arxiv.org/abs/2606.09204v1
Date: Mon, 08 Jun 2026 08:38:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.840929
Title: The Injection Paradox: Brand-Level Suppression in Safety-Trained LLM Recommendations via RAG Context Injection
Title（参考訳）: インジェクションパラドックス:RAGコンテキストインジェクションによる安全訓練LDMレコメンデーションにおけるブランドレベル抑制
Authors: Hyunseok Paeng,
Abstract要約: 安全訓練されたクロードモデルでは、プロンプトインジェクションを含む文書は、推奨率の急激な低下を被る。この抑制は、同じブランドの未修正文書に注入された文書を超えて伝播する。これらの知見は、相手が相手の文書にインジェクションを埋め込むリバースアタックシナリオの技術的可能性を高める。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a reproducible failure mode of safety training in RAG-based LLM recommendation -- the Injection Paradox -- in which prompt injections embedded in retrieved documents backfire against the attacker, suppressing the target brand below the injection-free baseline. In safety-trained Claude models, documents containing prompt injections suffer a sharp drop in recommendation rate, and this suppression propagates beyond the injected document to unmodified documents of the same brand. In Claude Opus 4.6, the target brand drops from a 54% baseline to zero top-2 recommendations across all 50 trials, even though only 1 of 4 brand documents in the corpus contains an injection. The directional pattern is reproduced in counterfactual experiments and across three brands. A contrasting result across the GPT models tested, where the same injection instead increases recommendations, suggests model-family differences in how injection-like context affects recommendation behavior. These findings raise the technical possibility of a reverse-attack scenario in which an adversary embeds injections in a competitor's documents to suppress the competitor's brand via safety-sensitive model behavior.
Abstract（参考訳）: RAGベースのLLMレコメンデーションであるインジェクションパラドックス(Injection Paradox)で再現可能な安全性トレーニングの失敗モードを提案し、そこでは、検索したドキュメントに埋め込まれたインジェクションを攻撃者にバックファイアさせ、インジェクションフリーベースラインの下にあるターゲットブランドを抑える。安全訓練されたクロードモデルでは、プロンプトインジェクションを含む文書は推奨率を急落させ、この抑制は、同じブランドの修正されていないドキュメントに注入されたドキュメントを超えて伝播する。 Claude Opus 4.6では、コーパス内の4つのブランド文書のうち1つだけがインジェクションを含むにもかかわらず、54%のベースラインからトップ2のレコメンデーションをゼロにしている。方向パターンは、カウンターファクト実験と3つのブランドで再現される。同じインジェクションがレコメンデーションを増加させる、テストされたGPTモデル間の対照的な結果は、インジェクションのようなコンテキストがレコメンデーションの振る舞いにどのように影響するかというモデル固有の違いを示唆している。これらの知見は、相手が相手の文書にインジェクションを埋め込んで、安全に敏感なモデル行動を通じて相手のブランドを抑圧する逆攻撃シナリオの技術的可能性を高める。

論文の概要: The Injection Paradox: Brand-Level Suppression in Safety-Trained LLM Recommendations via RAG Context Injection

関連論文リスト