Fugu-MT 論文翻訳(概要): Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures

論文の概要: Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures

arxiv url: http://arxiv.org/abs/2601.23081v1
Date: Fri, 30 Jan 2026 15:28:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-02 18:28:15.527869
Title: Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures
Title（参考訳）: 大規模言語モデルにおける潜在変数としてのキャラクタリゼーション:創発的ミスと条件付き安全障害の力学的考察
Authors: Yanghao Su, Wenbo Zhou, Tianwei Zhang, Qiu Han, Weiming Zhang, Nenghai Yu, Jie Zhang,
Abstract要約: 創発的ミスアライメント(英: Emergent Misalignment)とは、狭い範囲のデータに対する微調整された大きな言語モデルによって、広範囲に不整合な振る舞いが引き起こされる障害モードを指す。複数のドメインやモデルファミリにまたがって、特定の文字レベルの配置を示すデータの微調整モデルは、誤操作よりもはるかに強く、転送可能な微調整を誘導する。
参考スコア（独自算出の注目度）: 70.48661957773449
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Emergent Misalignment refers to a failure mode in which fine-tuning large language models (LLMs) on narrowly scoped data induces broadly misaligned behavior. Prior explanations mainly attribute this phenomenon to the generalization of erroneous or unsafe content. In this work, we show that this view is incomplete. Across multiple domains and model families, we find that fine-tuning models on data exhibiting specific character-level dispositions induces substantially stronger and more transferable misalignment than incorrect-advice fine-tuning, while largely preserving general capabilities. This indicates that emergent misalignment arises from stable shifts in model behavior rather than from capability degradation or corrupted knowledge. We further show that such behavioral dispositions can be conditionally activated by both training-time triggers and inference-time persona-aligned prompts, revealing shared structure across emergent misalignment, backdoor activation, and jailbreak susceptibility. Overall, our results identify character formation as a central and underexplored alignment risk, suggesting that robust alignment must address behavioral dispositions rather than isolated errors or prompt-level defenses.
Abstract（参考訳）: 創発的ミスアライメント(英: Emergent Misalignment)とは、狭い範囲のデータに対して微調整された大きな言語モデル(LLM)が広範囲に不整合な振る舞いを引き起こす障害モードを指す。以前の説明では、この現象は誤った内容や安全でない内容の一般化に起因している。この研究で、この見解は不完全であることを示す。複数のドメインやモデルファミリーにまたがって、特定のキャラクタレベルの配置を示すデータの微調整モデルは、誤操作の微調整よりもはるかに強く、転送可能な誤調整を引き起こすが、概ね汎用性は保たれている。このことは、創発的なミスアライメントは、能力劣化や腐敗した知識からではなく、モデル行動の安定したシフトから生じることを示している。さらに,トレーニング時トリガと推論時ペルソナアライメントプロンプトの両方によって,このような行動の挙動が条件付きで活性化され,創発的ミスアライメント,バックドアアクティベーション,脱獄性といった共通構造が明らかになった。総じて,本研究は,文字形成を中心的かつ過小評価されたアライメントリスクとして認識し,ロバストアライメントは孤立した誤りや即時防御よりも行動配置に対処する必要があることを示唆した。

論文の概要: Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures

関連論文リスト