Fugu-MT 論文翻訳(概要): Emergent alignment and the projectability of ethical personas

論文の概要: Emergent alignment and the projectability of ethical personas

arxiv url: http://arxiv.org/abs/2606.09475v1
Date: Mon, 08 Jun 2026 13:30:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:07.092619
Title: Emergent alignment and the projectability of ethical personas
Title（参考訳）: 倫理的ペルソナの創発的アライメントと投影可能性
Authors: Guillermo Del Pinal, Youngchan Lee, Cameron McNamara, Alejandro Perez Carballo,
Abstract要約: 広範かつ狭義の安全タスクについて、有用なのみのモデルを精査する。 2つの狭い安全サブカテゴリの微調整が創発的アライメントを確実に引き起こすことを示す。我々は、アライメント戦略は、一般的な安全性能だけでなく、プロジェクタビリティの程度でも評価されるべきであると結論付けている。
参考スコア（独自算出の注目度）: 39.3098730337656
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Work on `emergent misalignment' shows that finetuning LLMs on narrow tasks can induce broadly misaligned behavior. This supports the `persona selection' (PSM) hypothesis: during pre-training, LLMs learn to simulate different characters and perspectives, which can be elicited and refined during post-training. This paper investigates the converse phenomenon, `emergent alignment', and uses it to support and refine the PSM and motivate a novel desideratum for alignment. We finetune a helpful-only model on broad and narrow safety tasks. To create SFT samples, we follow the `Constitutional AI' (CAI) approach and use four constitutions which encode reasonable alignment strategies: deontology, consequentialism, virtue ethics, and aligning AIs as subordinate to human authority. For each of those models, we show that finetuning on two narrow safety sub-categories reliably induces emergent alignment over a representative set of general safety categories, and on safety subcategories that we directly filtered-out of the data sets used for narrow alignment. To test the `PSM' using a more fine-grained evaluation, we used a multidimensional `ethical persona' diagnostic. For each constitutionally finetuned (broad/narrow) model, we evaluate how well their behavior matches their expected signature profile. Our results show that our CAI models acquire their expected ``ethical persona'' -- e.g., the model narrowly fine-tuned on SFT samples created using the consequentialist constitution agrees significantly more with utilitarian than deontological beliefs. Yet our coarse and fine-grained evaluations show that there are significant differences across our (broad/narrow) finetuned CAI models in how well they project. We conclude that alignment strategies should be evaluated, not just on their (in-distribution) general safety performance, but also specifically on their degree of projectability.
Abstract（参考訳）: 創発的ミスアライメント(emergent misalignment)”の研究は、狭いタスクに対する微調整 LLM が、広範囲に不整合な振る舞いを誘発することを示している。プレトレーニングの間、LLMは異なる文字と視点をシミュレートすることを学び、これはポストトレーニング中に引き起こされ、洗練される。本稿では,その逆現象である「創発的アライメント」を考察し,それを用いてPSMを支援し,洗練し,新たなアライメントのためのデシデラトゥムの動機付けを行う。広範かつ狭義の安全タスクについて、有用なのみのモデルを精査する。 SFTサンプルを作成するために、我々は「コンスティスティカルAI(Constitutional AI)」アプローチに従い、合理的なアライメント戦略であるデオントロジー、コンシークエンシズム、美徳倫理、AIを人間の権威に従属させる4つのコンスティチューションを使用する。それぞれのモデルに対して,2つの狭い安全サブカテゴリのファインタニングは,一般的な安全カテゴリの代表的なセットに対する創発的アライメントを確実に誘導し,狭いアライメントに使用するデータセットの直接フィルタリングを行う安全サブカテゴリについて示す。よりきめ細かな評価を用いてPSMを検査するために,多次元の「倫理的ペルソナ」診断を用いた。構成的に微調整された各モデル(ブロード/ナロー)に対して、それらの振る舞いが期待されるシグネチャプロファイルとどの程度うまく一致しているかを評価する。以上の結果から,我々のCAIモデルが期待する「倫理的ペルソナ」を取得すること,例えば,連続的構成を用いて作成したSFTサンプルを狭義に微調整したモデルが,非オントロジ的信念よりも実用性に大きく合致していることが示唆された。しかし、粗くきめ細かな評価は、我々の(広さ/狭さの)微調整されたCAIモデル間で、プロジェクトがどのようにうまく行っているかに大きな違いがあることを示している。我々は、アライメント戦略は、一般的な安全性能だけでなく、特にプロジェクタビリティの程度に基づいて評価されるべきである、と結論付けている。

論文の概要: Emergent alignment and the projectability of ethical personas

関連論文リスト