Fugu-MT 論文翻訳(概要): Whether, Not Which: Mechanistic Interpretability Reveals Dissociable Affect Reception and Emotion Categorization in LLMs

論文の概要: Whether, Not Which: Mechanistic Interpretability Reveals Dissociable Affect Reception and Emotion Categorization in LLMs

arxiv url: http://arxiv.org/abs/2603.22295v1
Date: Sun, 15 Mar 2026 15:11:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 19:53:37.030106
Title: Whether, Not Which: Mechanistic Interpretability Reveals Dissociable Affect Reception and Emotion Categorization in LLMs
Title（参考訳）: 機械的解釈性はLLMの解離性受容と感情分類に影響を及ぼすか
Authors: Michael Keeman,
Abstract要約: 臨床心理学を基礎とした機械的解釈可能性法による感情回路クレームの臨床的妥当性試験を初めて行った。我々は2つの解離可能な感情処理機構を発見する。我々は,大規模言語モデルにおける感情処理のクレームをテストするための厳格な基準として,臨床刺激法を紹介した。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models appear to develop internal representations of emotion -- "emotion circuits," "emotion neurons," and structured emotional manifolds have been reported across multiple model families. But every study making these claims uses stimuli signalled by explicit emotion keywords, leaving a fundamental question unanswered: do these circuits detect genuine emotional meaning, or do they detect the word "devastated"? We present the first clinical validity test of emotion circuit claims using mechanistic interpretability methods grounded in clinical psychology -- clinical vignettes that evoke emotions through situational and behavioural cues alone, emotion keywords removed. Across six models (Llama-3.2-1B, Llama-3-8B, Gemma-2-9B; base and instruct variants), we apply four convergent mechanistic interpretability methods -- linear probing, causal activation patching, knockout experiments, and representational geometry -- and discover two dissociable emotion processing mechanisms. Affect reception -- detecting emotionally significant content -- operates with near-perfect accuracy (AUROC 1.000), consistent with early-layer saturation, and replicates across all six models. Emotion categorization -- mapping affect to specific emotion labels -- is partially keyword-dependent, dropping 1-7% without keywords and improving with scale. Causal activation patching confirms keyword-rich and keyword-free stimuli share representational space, transferring affective salience rather than emotion-category identity. These findings falsify the keyword-spotting hypothesis, establish a novel mechanistic dissociation, and introduce clinical stimulus methodology as a rigorous standard for testing emotion processing claims in large language models -- with direct implications for AI safety evaluation and alignment. All stimuli, code, and data are released for replication.
Abstract（参考訳）: 大きな言語モデルでは、感情の内的表現 - 「感情回路」、「感情ニューロン」、そして「構造化された感情多様体」が複数のモデルファミリーで報告されている。しかし、これらの主張を行うすべての研究は、明示的な感情キーワードによってシグナルを伝達し、基本的な疑問を未解決のまま残している。これらの回路は真の感情的意味を検知するか、それとも「普及」された」単語を検知するか。我々は、臨床心理学に根ざした機械的解釈可能性法による感情回路のクレームの最初の臨床正当性テスト - 状況的および行動的手がかりだけで感情を誘発する臨床ヴィグレット、感情キーワードを除去する。 6つのモデル (Llama-3.2-1B, Llama-3-8B, Gemma-2-9B, base and instruct variants) に対して, 線形探索, 因果活性化パッチ, ノックアウト実験, 表現幾何学の4つの収束機械的解釈法を適用し, 2つの解離可能な感情処理機構を探索する。感情的に重要なコンテンツを検出する影響受信は、ほぼ完全な精度(AUROC 1.000)で動作し、初期層の飽和と整合し、6つのモデルすべてに複製する。感情分類 -- 特定の感情ラベルに対するマッピングの影響 -- は、部分的にキーワード依存であり、キーワードなしで1-7%減少し、スケールで改善されている。因果的アクティベーションパッチングは、感情カテゴリーのアイデンティティよりも感情的サリエンスを伝達する、キーワードリッチでキーワードフリーな刺激共有表現空間を確認する。これらの結果は、キーワードスポッティング仮説を偽装し、新しい機械的解離を確立し、AIの安全性評価とアライメントに直接的な意味を持つ、大きな言語モデルで感情処理のクレームをテストするための厳格な標準として、臨床刺激方法論を導入する。すべての刺激、コード、データは複製のためにリリースされます。

論文の概要: Whether, Not Which: Mechanistic Interpretability Reveals Dissociable Affect Reception and Emotion Categorization in LLMs

関連論文リスト