Fugu-MT 論文翻訳(概要): Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

論文の概要: Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

arxiv url: http://arxiv.org/abs/2606.06333v1
Date: Thu, 04 Jun 2026 16:08:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.934496
Title: Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability
Title（参考訳）: 部分空間対応スパースオートエンコーダの有効機械論的解釈性
Authors: Seyed Arshan Dalili, Mehrdad Mahdavi,
Abstract要約: スパースオートエンコーダ(SAE)は、大規模言語モデルにおける機械的解釈可能性に広く用いられている。この仮定はモデル特徴の多次元構造と一致しないことを示す。本稿では,単一ベクトルデコーダを学習したデコーダサブスペースに置き換えるSubspace-Aware Sparse Autoencoders (SASA)を紹介する。
参考スコア（独自算出の注目度）: 11.543771846135021
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse Autoencoders (SAEs) are widely used for mechanistic interpretability in large language models, yet their formulation assigns each latent feature a single decoder direction, implicitly assuming features to be one-dimensional. We show that this assumption mismatches with the multi-dimensional structure of model features, provably inducing feature splitting through two distinct mechanisms. Geometrically, reconstructing a feature of intrinsic dimension $d_i \ge 2$ to error $\varepsilon$ with single-direction decoders forces a number of atoms that is exponential in $d_i$. From an end-to-end optimization perspective, this splitting is not merely possible but actively preferred. We prove that there exists a continuous path from the true $d_i$-dimensional basis to a strictly lower risk of the $\ell_1$-regularized SAE objective, whose descent directions drive any trained dictionary into that exponential regime. A single coherent feature is therefore fragmented across many near-collinear latents, producing spurious multiplicity and obscuring the intrinsic geometry. Motivated by this, we introduce Subspace-Aware Sparse Autoencoders (SASA), which replace single-vector decoders with learned decoder subspaces, enforce block sparsity via Top-$s$ group gating, and adapt each group's effective rank with a nuclear-norm regularizer. We then show that once the block size satisfies $r \ge d_i$, a single group not only can represent the entire feature slice but is the global minimizer of the SASA objective. This consolidation yields a sample complexity polynomial in $d_i$ rather than exponential -- a decisive advantage given that every training activation costs an LLM forward pass. Empirically, on GPT-2 and Mistral-7B, SASA reduces feature splitting and absorption, improves monosemanticity and interpretability, and matches or exceeds standard SAEs while training on roughly half the token budget.
Abstract（参考訳）: スパースオートエンコーダ (SAE) は、大規模言語モデルにおける機械論的解釈に広く用いられているが、それらの定式化では、各潜時特徴を1次元と暗黙的に仮定して、単一のデコーダ方向を割り当てている。この仮定はモデル特徴の多次元構造と一致せず、2つの異なるメカニズムを通して特徴分割を誘発することを示す。幾何学的には、固有次元 $d_i \ge 2$ の特徴を誤り $\varepsilon$ で再構成すると、単方向デコーダは $d_i$ で指数関数的な多くの原子を強制する。エンドツーエンドの最適化の観点からは、この分割は単に可能ではなく、積極的に推奨される。我々は、真の$d_i$-次元基底から$\ell_1$-regularized SAE目標の厳密に低いリスクへの連続経路が存在することを証明した。したがって、単一のコヒーレントな特徴は、多くの近コヒーレントなラテントに断片化され、急激な多重性を生み出し、固有の幾何学を隠蔽する。そこで我々は,SASA(Subspace-Aware Sparse Autoencoders)を導入し,SASA(Subspace-Aware Sparse Autoencoders)を導入した。次に、ブロックサイズが$r \ge d_i$を満たすと、単一のグループが特徴スライス全体を表現できるだけでなく、SASAの目的の国際最小化であることを示す。この統合により、指数関数ではなく$d_i$のサンプル複雑性多項式が得られる。実証的には、GPT-2とMistral-7Bでは、SASAは機能分割と吸収を減らし、単意味性と解釈性を改善し、トークン予算の約半分でトレーニング中に標準のSAEと一致または超過する。

論文の概要: Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

関連論文リスト