Fugu-MT 論文翻訳(概要): The Alignment Floor: How Persona Customization Breaks Safety in Weakly-Aligned LLMs

論文の概要: The Alignment Floor: How Persona Customization Breaks Safety in Weakly-Aligned LLMs

arxiv url: http://arxiv.org/abs/2605.27382v2
Date: Thu, 28 May 2026 02:39:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-15 07:09:36.512219
Title: The Alignment Floor: How Persona Customization Breaks Safety in Weakly-Aligned LLMs
Title（参考訳）: 人格のカスタマイズがLLMの安全性を損なう「アライメントフロア」
Authors: Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He,
Abstract要約: 本稿では,RLHF+構成AIモデルと,より軽量に整合したモデルとを対比したケーススタディを提案する。このギャップをアライメントフロアとして定義する: $_textfloor(m)=max_pS(m,p)-min_pS(m,p)$。デプロイ時の監査指標として$_textfloor$を提案し、ペルソナのカスタマイズをデプロイする前に、小さなペルソナパネルで測定する。
参考スコア（独自算出の注目度）: 9.989306175511238
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Telling an LLM to "be enthusiastic" raises its sycophancy rate from 30\% to 50\% on a lightly-aligned model, but has zero effect on a strongly-aligned one. We define this gap as the alignment floor, $Δ_{\text{floor}}(m)=\max_pS(m,p)-\min_pS(m,p)$, the range of sycophancy rates a model produces across persona conditions, and treat sycophancy as a persona-conditional property rather than a fixed model property. Pluralistic AI relies on behavioral adaptation via persona prompts like "be creative" or "be thorough", which let systems respect diverse user values and communication styles; the safety question is how much customization a given model can absorb before its truthfulness shifts. We present a controlled case study contrasting a strongly-aligned RLHF + Constitutional-AI model (Claude Sonnet 4.6) with a more lightly-aligned model (Amazon Nova Lite), spanning seven persona conditions and five tasks for 1800 total runs. An existence-pair result motivates per-model auditing: there is at least one strongly-aligned model with $Δ_{\text{floor}}=5$pp (within 5pp of the 15\% control rate) and at least one lightly-aligned model with 45pp (5\%--50\% range). On the lightly-aligned model, all five Big Five personas increase sycophancy over control, and counterintuitively Agreeableness produces the smallest increase, not the largest. The single largest effect in the study is constructive: a Skeptic persona reduces sycophancy by 25pp on the lightly-aligned model, and is the only persona that instructs resistance against user claims rather than engagement with them, suggesting a directionality account. Cross-model transfer of persona effects is near-zero, so persona-alignment testing must be per-model. We propose $Δ_{\text{floor}}$ as a deployment-time audit metric: measure it on a small persona panel before deploying persona customization.
Abstract（参考訳）: LLMに"熱心になる"ように指示すると、そのサイコフィナンシー率は、軽く整列されたモデルでは30\%から50\%に上昇するが、強く整列したモデルでは効果がゼロになる。このギャップをアライメントフロアとして定義する: $Δ_{\text{floor}}(m)=\max_pS(m,p)-\min_pS(m,p)$, モデルがペルソナ条件で生成する空白率の範囲を固定モデル特性ではなくペルソナ条件特性として扱う。複数のAIは、“創造的”や“徹底的”といったペルソナのプロンプトによる行動適応に依存しているため、システムは多様なユーザ価値やコミュニケーションスタイルを尊重することができる。本研究は,RLHF+コンスティチューショナルAIモデル(Claude Sonnet 4.6)と,より軽量に整合したモデル(Amazon Nova Lite)を対比したケーススタディである。例えば、$Δ_{\text{floor}}=5$pp(制御率15\%の5pp)の強い整合モデルと、45pp(5\%-50\%の範囲)の光整合モデルがある。軽く整列されたモデルでは、5人のビッグ・ファイブ・パーソナが制御よりもサイコフィナンシーを増し、対意的にアグレタビリティーは最大ではなく最小の増加を生み出す。この研究で最も大きな効果は建設的であり、懐疑的な人格は、軽度に整列したモデルで梅毒を25pp減らし、彼らとの関わりよりもユーザクレームに対する抵抗を指示する唯一の人格であり、方向性の説明を示唆している。ペルソナ効果のクロスモデル転送はゼロに近いため、ペルソナアライメントテストはモデル毎に行う必要がある。デプロイ時の監査指標として$Δ_{\text{floor}}$を提案し、ペルソナのカスタマイズをデプロイする前に小さなペルソナパネルで測定する。

論文の概要: The Alignment Floor: How Persona Customization Breaks Safety in Weakly-Aligned LLMs

関連論文リスト