Fugu-MT 論文翻訳(概要): Rapid Poison: Practical Poisoning Attacks Against the Rapid Response Framework

論文の概要: Rapid Poison: Practical Poisoning Attacks Against the Rapid Response Framework

arxiv url: http://arxiv.org/abs/2606.16242v1
Date: Mon, 15 Jun 2026 05:40:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 16:21:34.099103
Title: Rapid Poison: Practical Poisoning Attacks Against the Rapid Response Framework
Title（参考訳）: ラピッド・ポゾン:ラピッド・レスポンス・フレームワークに対する実用的な攻撃
Authors: David Huang, Jaewon Chang, Avidan Shah, Prateek Mittal, Chawin Sitawarin,
Abstract要約: プロンプトインジェクションはパイプラインに浸透し、有毒なサンプルをトレーニングセットに届けることを示す。我々は、新しい現象を利用するOmission Attackでこの問題に対処する: 概念を欠いた安全でないサンプルを訓練する際、分類器はその概念の存在を安全なラベルと誤認する。
参考スコア（独自算出の注目度）: 38.96400184175405
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Rapid Response (RR) framework, deployed in production systems, including Anthropic's ASL-3 safeguards, continuously improves jailbreak-detection classifiers. When new jailbreaks emerge that bypass these classifiers, Rapid Response generates synthetic variants for training, helping the model generalize from the new attacks and quickly adapt. We reveal that prompt injection can infiltrate this pipeline to deliver poisoned samples into the classifier's training set, enabling two attack objectives: (I) targeted poisoning attacks that create false positives on harmless samples by categorizing them as a jailbreak, with a specific desired feature (e.g., certain formatting, subject, or keyword), (II) concept-based backdoor attacks that induce false negatives on jailbreak inputs, generalizing even to jailbreaks from attack strategies the defender explicitly trained against, when the backdoor trigger is present. Importantly, our threat model restricts adversaries to modifying only jailbreak samples (not benign data or labels), a constraint unexplored by prior work that makes the second objective particularly challenging. We address this with Omission Attack, which exploits a new phenomenon: when training on concept-absent unsafe samples, the classifier misassociates that concept's presence with the safe label. Both attacks cause substantial and in some cases near-complete label flipping at only a 1% poisoning rate, achieving up to 100% false positive rates and up to 96% false negative rates.
Abstract（参考訳）: AnthropicのASL-3セーフガードを含むプロダクションシステムにデプロイされたRapid Response (RR)フレームワークは、継続的にジェイルブレイク検出分類を改善している。これらの分類器をバイパスする新しいジェイルブレイクが発生すると、Rapid Responseはトレーニング用の合成変種を生成し、新しい攻撃からモデルを一般化し、迅速に適応するのに役立つ。即時注入は、このパイプラインに侵入して、分類器のトレーニングセットに有毒なサンプルを供給し、2つの攻撃目標を可能にする。 (I) 有害なサンプルに対して偽陽性を発生させる標的中毒攻撃を、特定の所望の特徴(例えば、特定のフォーマット、主題、キーワード)を用いてジェイルブレイクとして分類すること (II) ジェイルブレイクインプットに偽陰性を誘導する概念に基づくバックドア攻撃、そして、バックドアトリガーが存在するとき、ディフェンダーが明示的に訓練された攻撃戦略からジェイルブレイクを一般化すること。重要なことは、我々の脅威モデルは、敵がジェイルブレイクのサンプルだけを変更すること(良心的なデータやラベルではない)を制限します。我々は、新しい現象を利用するOmission Attackでこの問題に対処する: 概念を欠いた安全でないサンプルを訓練するとき、分類器はその概念の存在を安全なラベルと誤認する。どちらの攻撃も実質的であり、場合によってはほぼ完全なラベルのフリップは1%程度しかなく、100%の偽陽性率、最大96%の偽陰性率に達する。

論文の概要: Rapid Poison: Practical Poisoning Attacks Against the Rapid Response Framework

関連論文リスト