Fugu-MT 論文翻訳(概要): Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

論文の概要: Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

arxiv url: http://arxiv.org/abs/2605.01899v1
Date: Sun, 03 May 2026 14:28:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.985099
Title: Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment
Title（参考訳）: 役割から遠ざかるインテント:ペルソナ不変の安全アライメントのための敵対的なセルフプレイ
Authors: Jiajia Li, Xiaoyu Wen, Zhongtian Ma, Shuyue Hu, Qiaosheng Zhang, Zhen Wang,
Abstract要約: PIA(Persona-Invariant Alignment)は、攻撃側のPersona Lineage Evolution(PLE)と防衛側のPersona-Invariant Consistency Learning(PICL)による共進化を実現する対戦型セルフプレイフレームワークである。 PICLは、一側KL分割制約を用いて、ペルソナの文脈から安全性決定を分離する構造的分離仮説に基づいている。実験結果から, ple はラインベースの信用伝搬を利用して, リスクの高いペルソナ空間を効率的に探索することを示した。
参考スコア（独自算出の注目度）: 13.780689172489934
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The growing capabilities of large language models (LLMs) have driven their widespread deployment across diverse domains, even in potentially high-risk scenarios. Despite advances in safety alignment techniques, current models remain vulnerable to emerging persona-based jailbreak attacks. Existing research on persona-based jailbreak has primarily focused on attack iterations, yet it lacks systemic and mechanistic constraints on the defense side. To address this challenge, we propose Persona-Invariant Alignment (PIA), an adversarial self-play framework that achieves co-evolution through Persona Lineage Evolution (PLE) on the attack side and Persona-Invariant Consistency Learning (PICL) on the defense side. Theoretically, PICL is grounded in the structural separation hypothesis, using a unilateral KL-divergence constraint to enable the structural decoupling of safety decisions from persona context, thereby maintaining safe behavior under persona-based jailbreak attacks. Experimental results demonstrate that PLE efficiently explores high-risk persona spaces by leveraging lineage-based credit propagation. Meanwhile, the PICL defense method significantly reduces the Attack Success Rate (ASR) while preserving the model's general capability, thereby validating the superiority and robustness of this alignment paradigm. Codes are available at https://github.com/JiajiaLi-1130/PIA.
Abstract（参考訳）: 大きな言語モデル(LLM)の能力の増大は、リスクの高いシナリオであっても、さまざまなドメインにまたがる広範なデプロイメントを誘導している。安全アライメント技術の進歩にもかかわらず、現在のモデルは新興のペルソナベースのジェイルブレイク攻撃に弱いままである。パーソナをベースとしたジェイルブレイクに関する既存の研究は、主に攻撃の繰り返しに焦点を当てているが、防衛面での体系的および機械的制約は欠如している。この課題に対処するために,攻撃側のペルソナ線形進化(PLE)と防衛側のペルソナ不変整合学習(PICL)による共進化を実現する対向的なセルフプレイフレームワークであるPersona-Invariant Alignment(PIA)を提案する。理論的には、PICLは、一方的なKL分割制約を用いて、ペルソナの文脈から安全性決定を分離し、ペルソナベースのジェイルブレイク攻撃による安全な行動を維持する。実験結果から, ple はラインベースの信用伝搬を利用して, リスクの高いペルソナ空間を効率的に探索することを示した。一方、PICL防御法は、モデルの汎用性を保ちながらアタック成功率(ASR)を著しく低減し、このアライメントパラダイムの優越性と堅牢性を検証する。コードはhttps://github.com/JiajiaLi-1130/PIAで公開されている。

論文の概要: Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

関連論文リスト