Fugu-MT 論文翻訳(概要): Finding Belief Geometries with Sparse Autoencoders

論文の概要: Finding Belief Geometries with Sparse Autoencoders

arxiv url: http://arxiv.org/abs/2604.02685v1
Date: Fri, 03 Apr 2026 03:29:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 17:20:24.301348
Title: Finding Belief Geometries with Sparse Autoencoders
Title（参考訳）: スパースオートエンコーダによる信念ジオメトリの探索
Authors: Matthew Levinson,
Abstract要約: 本稿では,変圧器表現における単純な部分空間の候補を求めるパイプラインを提案する。隠れマルコフモデルを用いて学習した変圧器上でのパイプラインの検証を行った。真の信念のような幾何学がGemma-2-9Bの表現空間に存在するという予備的な証拠がある。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Understanding the geometric structure of internal representations is a central goal of mechanistic interpretability. Prior work has shown that transformers trained on sequences generated by hidden Markov models encode probabilistic belief states as simplex-shaped geometries in their residual stream, with vertices corresponding to latent generative states. Whether large language models trained on naturalistic text develop analogous geometric representations remains an open question. We introduce a pipeline for discovering candidate simplex-structured subspaces in transformer representations, combining sparse autoencoders (SAEs), $k$-subspace clustering of SAE features, and simplex fitting using AANet. We validate the pipeline on a transformer trained on a multipartite hidden Markov model with known belief-state geometry. Applied to Gemma-2-9B, we identify 13 priority clusters exhibiting candidate simplex geometry ($K \geq 3$). A key challenge is distinguishing genuine belief-state encoding from tiling artifacts: latents can span a simplex-shaped subspace without the mixture coordinates carrying predictive signal beyond any individual feature. We therefore adopt barycentric prediction as our primary discriminating test. Among the 13 priority clusters, 3 exhibit a highly significant advantage on near-vertex samples (Wilcoxon $p < 10^{-14}$) and 4 on simplex-interior samples. Together 5 distinct real clusters pass at least one split, while no null cluster passes either. One cluster, 768_596, additionally achieves the highest causal steering score in the dataset. This is the only case where passive prediction and active intervention converge. We present these findings as preliminary evidence that genuine belief-like geometry exists in Gemma-2-9B's representation space, and identify the structured evaluation that would be required to confirm this interpretation.
Abstract（参考訳）: 内部表現の幾何学的構造を理解することは機械的解釈可能性の中心的な目標である。以前の研究は、隠れマルコフモデルによって生成されたシーケンスに基づいて訓練されたトランスフォーマーが、潜在生成状態に対応する頂点を持つ残ストリームにおいて、確率的信念状態を単純なx字型ジオメトリとしてエンコードしていることを示している。自然主義的テキストで訓練された大きな言語モデルが類似した幾何学的表現を発達させるかどうかは未解決の問題である。本稿では,Sparse Autoencoders (SAEs),$k$-subspace clustering of SAE features,Simplex fiting using AANetなどを組み合わせることで,トランスフォーマー表現における単純な部分空間の候補を見つけるパイプラインを提案する。我々は,多部隠れマルコフモデルで訓練された変圧器上でのパイプラインの検証を行った。 Gemma-2-9B に適用し、候補の単純度を示す 13 個の優先度クラスタを同定する(K \geq 3$)。重要な課題は、真の信念状態の符号化とタイリングアーティファクトを区別することである。そこで我々は, 偏心予測を主判別試験として採用した。 13の優先度クラスタのうち、3は近頂点サンプル(Wilcoxon $p < 10^{-14}$)と4つの単純な中間サンプルに非常に大きな優位性を示す。 5つの異なる実クラスタが少なくとも1つの分割を通過させる一方、ヌルクラスタはどちらも通過しない。 1つのクラスタ(768_596)は、データセットで最高因果ステアリングスコアを達成している。これは受動的予測と能動的介入が収束する唯一の事例である。これらの知見は,Gemma-2-9Bの表現空間に真の信念のような幾何学が存在することを示す予備的証拠として提示し,この解釈を裏付けるために必要となる構造的評価を同定する。

論文の概要: Finding Belief Geometries with Sparse Autoencoders

関連論文リスト