Fugu-MT 論文翻訳(概要): Psychological Concept Neurons: Can Neural Control Bias Probing and Shift Generation in LLMs?

論文の概要: Psychological Concept Neurons: Can Neural Control Bias Probing and Shift Generation in LLMs?

arxiv url: http://arxiv.org/abs/2604.11802v1
Date: Mon, 13 Apr 2026 17:58:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:16.745771
Title: Psychological Concept Neurons: Can Neural Control Bias Probing and Shift Generation in LLMs?
Title（参考訳）: 心理学的概念ニューロン:LLMにおける神経制御バイアスの探索とシフト生成は可能か?
Authors: Yuto Harada, Hiro Taiyo Hamada,
Abstract要約: ビッグファイブのような心理的構造を用いて、大きな言語モデル(LLM)は特定のパーソナリティプロファイルを模倣し、ユーザのパーソナリティを予測する。本研究では,質問紙操作のビッグファイブ概念に着目し,内部表現の形成と局所化を解析し,これらの表現が行動出力とどのように関連しているかを考察する。
参考スコア（独自算出の注目度）: 0.7265327330178978
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Using psychological constructs such as the Big Five, large language models (LLMs) can imitate specific personality profiles and predict a user's personality. While LLMs can exhibit behaviors consistent with these constructs, it remains unclear where and how they are represented inside the model and how they relate to behavioral outputs. To address this gap, we focus on questionnaire-operationalized Big Five concepts, analyze the formation and localization of their internal representations, and use interventions to examine how these representations relate to behavioral outputs. In our experiment, we first use probing to examine where Big Five information emerges across model depth. We then identify neurons that respond selectively to each Big Five concept and test whether enhancing or suppressing their activations can bias latent representations and label generation in intended directions. We find that Big Five information becomes rapidly decodable in early layers and remains detectable through the final layers, while concept-selective neurons are most prevalent in mid layers and exhibit limited overlap across domains. Interventions on these neurons consistently shift probe readouts toward targeted concepts, with targeted success rates exceeding 0.8 for some concepts, indicating that the model's internal separation of Big Five personality traits can be causally steered. At the label-generation level, the same interventions often bias generated label distributions in the intended directions, but the effects are weaker, more concept-dependent, and often accompanied by cross-trait spillover, indicating that comparable control over generated labels is difficult even with interventions on a large fraction of concept-selective neurons. Overall, our findings reveal a gap between representational control and behavioral control in LLMs.
Abstract（参考訳）: ビッグファイブのような心理的構造を用いて、大きな言語モデル(LLM)は特定のパーソナリティプロファイルを模倣し、ユーザのパーソナリティを予測する。 LLMはこれらの構造と整合した振舞いを示すことができるが、モデルの内部でどのように表現され、どのように振舞いの出力に関連付けられているのかは不明である。このギャップに対処するために,質問紙操作のビッグファイブ概念に着目し,内部表現の形成と局所化を分析し,これらの表現が行動出力にどのように関係しているかを調べるために介入を利用する。実験では,まず探索を用いて,モデル深度にわたってビッグファイブ情報がどのように現れるかを調べる。次に、各ビッグファイブ概念に選択的に反応するニューロンを特定し、その活性化の促進または抑制が意図した方向における潜在表現とラベル生成をバイアスするかどうかをテストする。ビッグファイブ情報は、初期層において急速にデオード可能になり、最終層を通して検出され続けるのに対して、概念選択ニューロンは中層において最も多く、ドメイン間の重複が制限されている。これらのニューロンに対する干渉は、目標とする概念に対してプローブの読み出しを常にシフトさせ、いくつかの概念において目標とする成功率は0.8を超え、モデルの内部でビッグファイブの性格特性を分離することは因果的に決定できることを示している。ラベル生成レベルでは、同じ介入は意図された方向において生成されたラベル分布をバイアスすることが多いが、影響は弱く、概念に依存し、しばしば横断的な流出を伴う。総じて, LLMにおける表現制御と行動制御のギャップが指摘された。

論文の概要: Psychological Concept Neurons: Can Neural Control Bias Probing and Shift Generation in LLMs?

関連論文リスト