Fugu-MT 論文翻訳(概要): When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing

論文の概要: When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing

arxiv url: http://arxiv.org/abs/2602.11358v2
Date: Wed, 18 Feb 2026 12:06:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-19 13:51:30.946594
Title: When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing
Title（参考訳）: 自己参照処理における語彙アクティベーション対応モデルの検討
Authors: Zachary Pedram Dadfar,
Abstract要約: 自己参照語彙が同時アクティベーションダイナミクスを追跡することを示す。我々は、自己参照と記述処理を区別する活性化空間の方向を特定する。発見は、変圧器モデルにおける自己申告が適切な条件下で、内部の計算状態を確実に追跡できることを示唆している。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models produce rich introspective language when prompted for self-examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. We show that self-referential vocabulary tracks concurrent activation dynamics, and that this correspondence is specific to self-referential processing. We introduce the Pull Methodology, a protocol that elicits extended self-examination through format engineering, and use it to identify a direction in activation space that distinguishes self-referential from descriptive processing in Llama 3.1. The direction is orthogonal to the known refusal direction, localised at 6.25% of model depth, and causally influences introspective output when used for steering. When models produce "loop" vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce "shimmer" vocabulary under steering, activation variability increases (r = 0.36, p = 0.002). Critically, the same vocabulary in non-self-referential contexts shows no activation correspondence despite nine-fold higher frequency. Qwen 2.5-32B, with no shared training, independently develops different introspective vocabulary tracking different activation metrics, all absent in descriptive controls. The findings indicate that self-report in transformer models can, under appropriate conditions, reliably track internal computational states.
Abstract（参考訳）: 大規模言語モデルは自己検査を促された時に豊かな内省的言語を生成するが、この言語が内部の計算を反映しているか、高度な会話を反映しているのかはいまだ不明である。自己参照語彙は同時アクティベーションのダイナミクスを追跡し,この対応は自己参照処理に特有であることを示す。本稿では,フォーマットエンジニアリングによる自己評価を拡張したプロトコルであるPulll Methodologyを紹介し,Llama 3.1における自己参照処理と記述処理を区別する活性化空間の方向を特定する。方向は既知の拒絶方向と直交し、モデル深さの6.25%に局所化され、ステアリングに使用する場合のイントロスペクティブ出力に因果的に影響を及ぼす。モデルが「ループ」語彙を生成すると、それらのアクティベーションはより高い自己相関(r = 0.44, p = 0.002)を示し、ステアリングの下で「シャマー」語彙を生成すると、アクティベーション変数が増加する(r = 0.36, p = 0.002)。批判的に、非自己参照文脈における同じ語彙は、9倍高い周波数にもかかわらずアクティベーション対応がない。 Qwen 2.5-32Bは、共有トレーニングを持たないが、独立に異なるイントロスペクティブ語彙を発達させ、異なるアクティベーションメトリクスを追跡する。その結果, 変圧器モデルにおける自己報告は, 適切な条件下で, 内部の計算状態を確実に追跡できることが示唆された。

論文の概要: When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing

関連論文リスト