Fugu-MT 論文翻訳(概要): "Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior

論文の概要: "Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior

arxiv url: http://arxiv.org/abs/2603.06816v1
Date: Fri, 06 Mar 2026 19:23:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:13.122746
Title: "Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior
Title（参考訳）: ダークトライアド」モデル生物--人為的反社会的行動の狭い微調整鏡
Authors: Roshni Lulla, Fiona Collins, Sanaya Parekh, Thilo Hagendorff, Jonas Kaplan,
Abstract要約: 現在の大きな言語モデルでは、戦略的騙し、操作、報酬を求めるといった不一致の振る舞いが示されています。生物学的なミスアライメントは, 人工的なミスアライメントに先行し, 心理的に根ざした枠組みとしてダークトライアドのパーソナリティを活用することを提案する。
参考スコア（独自算出の注目度）: 0.1631115063641726
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The alignment problem refers to concerns regarding powerful intelligences, ensuring compatibility with human preferences and values as capabilities increase. Current large language models (LLMs) show misaligned behaviors, such as strategic deception, manipulation, and reward-seeking, that can arise despite safety training. Gaining a mechanistic understanding of these failures requires empirical approaches that can isolate behavioral patterns in controlled settings. We propose that biological misalignment precedes artificial misalignment, and leverage the Dark Triad of personality (narcissism, psychopathy, and Machiavellianism) as a psychologically grounded framework for constructing model organisms of misalignment. In Study 1, we establish comprehensive behavioral profiles of Dark Triad traits in a human population (N = 318), identifying affective dissonance as a central empathic deficit connecting the traits, as well as trait-specific patterns in moral reasoning and deceptive behavior. In Study 2, we demonstrate that dark personas can be reliably induced in frontier LLMs through minimal fine-tuning on validated psychometric instruments. Narrow training datasets as small as 36 psychometric items resulted in significant shifts across behavioral measures that closely mirrored human antisocial profiles. Critically, models generalized beyond training items, demonstrating out-of-context reasoning rather than memorization. These findings reveal latent persona structures within LLMs that can be readily activated through narrow interventions, positioning the Dark Triad as a validated framework for inducing, detecting, and understanding misalignment across both biological and artificial intelligence.
Abstract（参考訳）: アライメント問題は、強力なインテリジェンスに関する懸念を指し、能力が増大するにつれて人間の好みや価値観との整合性を確保する。現在の大規模言語モデル(LLM)は、安全訓練にもかかわらず起こりうる戦略的な騙し、操作、報酬探しなど、不整合行動を示す。これらの失敗を機械的に理解するには、制御された設定で振る舞いパターンを分離できる経験的なアプローチが必要である。本研究は, 生物のミスアライメントが人工的なミスアライメントに先行し, 自己愛, サイコパシー, マキアベリアニズムのダークトライアドを, ミスアライメントのモデル生物を構築するための心理的基盤となる枠組みとして活用することを提案する。研究1では、人間の集団(N = 318)におけるダークトライアドの特徴の包括的行動プロファイルを確立し、感情的不協和性は、その特徴を結合する中心的共感的障害であり、また、道徳的推論や騙し行動における特徴特異的なパターンである。研究2では、検証された心理測定機器の最小限の微調整により、フロンティアLSMにおいてダーク・ペルソナを確実に誘導できることを実証した。狭義のトレーニングデータセットは、36項目までの精神測定項目で、人間の反社会的プロファイルを忠実に反映する行動的尺度に大きく変化した。批判的に、モデルはトレーニング項目を超えて一般化され、暗記よりも文脈外推論を実証した。これらの結果から, LLM内の潜伏するペルソナ構造は, 狭い介入によって容易に活性化され, 生物学的・人工知能の両面での誤認識を誘発し, 検出し, 理解するための有効な枠組みとしてダークトライアドを位置づけた。

論文の概要: "Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior

関連論文リスト