Fugu-MT 論文翻訳(概要): CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning

論文の概要: CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning

arxiv url: http://arxiv.org/abs/2606.05523v1
Date: Thu, 04 Jun 2026 00:06:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-06 06:55:34.621492
Title: CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning
Title（参考訳）: CHASE:強化学習を用いたLCMの安全性向上のための赤-青対応
Authors: Rahul Markasserithodi, Aditya Joshi, Yuekang Li, Ishmanbir Singh, Chris Yoo, Alan Niu,
Abstract要約: ブラックボックス攻撃者と安全に配慮したディフェンダーを共同開発するチームリングフレームワークであるCHASEを紹介する。 CHASEカットはStrongREJECTスコアを43.2%削減し、良心的なプロンプトで0%の偽りを拒否する。見出し結果の他に、CHASEはテンプレートのないRL探索が、機械的に異なる攻撃ファミリー間で転送される潜在攻撃プリミティブを回復することを示している。
参考スコア（独自算出の注目度）: 11.739543857396775
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite advances in safety alignment, prompt-rewriting attacks such as persona modulation, fictional framing and persuasion-based reformulation, can bypass safety filters even on frontier models. Existing defenses either rely on non-scalable human curation or white-box optimisation that overfits to specific model internals, leaving aligned models brittle against the very class of adaptive black-box adversaries they will face in deployment. To address this gap, we introduce CHASE (Co-evolutionary Hardening through Adversarial Safety-Escalation), a closed-loop red-blue teaming framework in which a black-box attacker and a safety-aligned defender co-evolve. The attacker is trained via Group Relative Policy Optimization (GRPO) under a multiplicative reward that jointly enforces bypass effectiveness and intent fidelity, while the defender is hardened on the harvested adversarial rewrites through a two-stage GRPO + rejection-sampled SFT pipeline balanced with benign data. Evaluated on BeaverTails and JailbreakBench against five held-out attack families (PAIR, TAP, AutoDAN, PAP, Translation), CHASE cuts mean StrongREJECT score by 43.2\% with 0\% false-refusal on benign prompts. Beyond the headline result, CHASE shows that template-free RL exploration recovers latent attack primitives that transfer across mechanistically distinct attack families, suggesting a path toward LLM safety hardening that generalises beyond the narrow distributions achieved thus far in adversarial training.
Abstract（参考訳）: 安全アライメントの進歩にもかかわらず、ペルソナ変調、架空のフレーミング、説得に基づく改革のような即時書き換え攻撃は、フロンティアモデルでも安全フィルタを回避できる。既存の防御は、非スケール可能な人間のキュレーションや、特定のモデル内部に過度に適合するホワイトボックスの最適化に依存しており、配置時に直面する適応的なブラックボックスの敵に対して、アライメントされたモデルは脆弱である。このギャップに対処するため,我々は,ブラックボックス攻撃者と安全に配慮したディフェンダーが共進化するクローズドループ・レッドブルー・チーム・フレームワークであるCHASE(Co-evolutionary Hardening through Adversarial Safety-Escalation)を紹介した。攻撃者はグループ相対ポリシー最適化(GRPO)を通じて、有効性と意図の忠実性を共同で強制する乗法的な報酬の下で訓練され、一方ディフェンダーは、2段階のGRPO+拒絶サンプリングされたSFTパイプラインと良質なデータとのバランスを保ちながら、収穫された敵の書き直しに強化される。 PAIR, TAP, AutoDAN, PAP, 翻訳)に対してビーバータイルとジェイルブレイクベンチで評価され、CHASEのカット平均StrongREJECTスコアは43.2\%、良心的なプロンプトは0.%である。 CHASEは、見出しの他に、テンプレートのないRL探索は、機械的に異なる攻撃ファミリー間で伝達される潜伏攻撃プリミティブを回復し、LLMの安全性向上への道筋を示唆し、これまで敵の訓練で達成された狭い分布を超えて一般化している。

論文の概要: CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning

関連論文リスト