Fugu-MT 論文翻訳(概要): The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

論文の概要: The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

arxiv url: http://arxiv.org/abs/2605.08427v1
Date: Fri, 08 May 2026 19:41:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:49.638832
Title: The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play
Title（参考訳）: 鏡の中のアタッカー:アンコレド・バイポリシー・セルフプレイによる安全の自己整合性を破る
Authors: Gabriele La Malfa, Emanuele La Malfa, Saar Cohen, Jie M. Zhang, Michael Luck, Michael Wooldridge, Elizabeth Black,
Abstract要約: セルフプレイレッドチームは、AIの安全性を改善するための確立したアプローチである。凍結ベースモデル上でロール固有のLoRAアダプタを訓練するAnchored Bipolicy Self-Playを提案する。パラメータ効率は, 自動調整モデルと比較して, ファインチューニングよりも100倍向上し, 安全性も一貫した改善が得られた。
参考スコア（独自算出の注目度）: 16.696570190611112
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Self-play red team is an established approach to improving AI safety in which different instances of the same model play attacker and defender roles in a zero-sum game, i.e., where the attacker tries to jailbreak the defender; if self-play converges to a Nash equilibrium, the model is guaranteed to respond safely within the settings of the game. Although the parameter sharing enforced by the use of the same model for the two roles improves stability and performance, it introduces fundamental theoretical and architectural limitations. We show that the set of Nash equilibria that can be reached corresponds to a broad class of behaviours that includes trivial always refuse strategies and oracle-like defenders, thus limiting practical applicability. We then show that when attacker and defender share and update the same base model, the dynamics collapse to self-consistency, so that attacks do not enforce adversarial pressure on the defender. In response, we propose Anchored Bipolicy Self-Play, which trains distinct role-specific LoRA adapters on top of a frozen base model, thereby maintaining stable optimisation while preserving adversarial pressure through explicit role separation. In relation to standard self-play, we show up to 100x greater parameter efficiency than finetuning and consistent improvements in safety compared to self-play fine-tuned models. We evaluate on Qwen2.5-{3B, 7B,14B}-IT models across widely used safety benchmarks, showing improved robustness without loss of reasoning ability. Cross-play experiments further show that our attacker and defender models are superior to self-play in terms of adversarial defence and safety.
Abstract（参考訳）: セルフプレイレッドチーム(Self-play Red team)は、ゼロサムゲームにおいて、同じモデルの異なるインスタンスが攻撃者およびディフェンダーロールをプレイする、すなわち、攻撃者がディフェンダーをジェイルブレイクしようとする、AI安全性を改善するための確立されたアプローチである。 2つの役割に同じモデルを使用することで実施されるパラメータ共有は、安定性と性能を向上させるが、基本的な理論的およびアーキテクチャ上の制限を導入する。到達可能なナッシュ均衡の集合は、自明な戦略やオラクルのようなディフェンダーを含む幅広い種類の行動に対応し、実用的な適用性を制限していることを示す。次に、攻撃者とディフェンダーが同じベースモデルを共有して更新すると、ダイナミクスが自己整合性に崩壊し、攻撃がディフェンダーに敵意の圧力を強制しないことを示す。そこで本研究では, 凍結ベースモデル上で, ロール固有のLoRAアダプタを訓練し, 対向圧力を保ちながら, 安定的な最適化を維持できるAnchored Bipolicy Self-Playを提案する。従来のセルフプレイでは、ファインチューニングよりもパラメータ効率が最大100倍向上し、自己プレイのファインチューニングモデルに比べて安全性が一貫した。我々は,Qwen2.5-{3B,7B,14B}-ITモデルを広く使用されている安全ベンチマークで評価した。クロスプレイ実験により、攻撃者やディフェンダーのモデルは、敵の防御と安全性の点で、自己プレイよりも優れていることが示された。

論文の概要: The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

関連論文リスト