Fugu-MT 論文翻訳(概要): MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents

論文の概要: MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents

arxiv url: http://arxiv.org/abs/2604.03436v1
Date: Fri, 03 Apr 2026 20:20:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:18.577302
Title: MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents
Title（参考訳）: MetaSAEs: Decomposability Penaltyとの共同トレーニングにより、よりアトミックなスパースオートエンコーダを生産する
Authors: Matthew Levinson,
Abstract要約: スパースオートエンコーダ (SAE) はアライメント検出やモデルステアリングなどの安全関連アプリケーションにますます利用されている。実際には、SAEラテントは表現部分空間をブレンドする。単一の機能は、真の共通表現を共有しない意味的に異なるコンテキストをまたいでアクティベートすることができる。我々は,この部分空間ブレンディングを直接ペナルティ化する共同学習目標を導入する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Sparse autoencoders (SAEs) are increasingly used for safety-relevant applications including alignment detection and model steering. These use cases require SAE latents to be as atomic as possible. Each latent should represent a single coherent concept drawn from a single underlying representational subspace. In practice, SAE latents blend representational subspaces together. A single feature can activate across semantically distinct contexts that share no true common representation, muddying an already complex picture of model computation. We introduce a joint training objective that directly penalizes this subspace blending. A small meta SAE is trained alongside the primary SAE to sparsely reconstruct the primary SAE's decoder columns; the primary SAE is penalized whenever its decoder directions are easy to reconstruct from the meta dictionary. This occurs whenever latent directions lie in a subspace spanned by other primary directions. This creates gradient pressure toward more mutually independent decoder directions that resist sparse meta-compression. On GPT-2 large (layer 20), the selected configuration reduces mean $|\varphi|$ by 7.5% relative to an identical solo SAE trained on the same data. Automated interpretability (fuzzing) scores improve by 7.6%, providing external validation of the atomicity gain independent of the training and co-occurrence metrics. Reconstruction overhead is modest. Results on Gemma 2 9B are directional. On not-fully-converged SAEs, the same parameterization yields the best results, a $+8.6\%$ $Δ$Fuzz. Though directional, this is an encouraging sign that the method transfers to a larger model. Qualitative analysis confirms that features firing on polysemantic tokens are split into semantically distinct sub-features, each specializing in a distinct representational subspace.
Abstract（参考訳）: スパースオートエンコーダ (SAE) はアライメント検出やモデルステアリングなどの安全関連アプリケーションにますます利用されている。これらのユースケースでは、SAE潜伏剤は可能な限り原子である必要がある。各ラテントは、単一の基底表現部分空間から引き出された単一のコヒーレントな概念を表現すべきである。実際には、SAEラテントは表現部分空間をブレンドする。単一の機能は、真の共通表現を共有しない意味的に異なるコンテキストをまたいでアクティベートすることができ、既に複雑なモデル計算図を泥だらけにする。我々は,この部分空間ブレンディングを直接ペナルティ化する共同学習目標を導入する。一次のSAEと共に小さなメタSAEを訓練し、一次のSAEのデコーダ列を緩やかに再構築する。これは、後続方向が他の一次方向で区切られた部分空間にあるときに起こる。これにより、スパースメタ圧縮に抵抗するより相互に独立なデコーダ方向への勾配圧力が生じる。 GPT-2大容量(層20)では、選択された構成により、同じデータでトレーニングされた同一のソロSAEと比較して平均$|\varphi|$が7.5%削減される。自動解釈可能性(ファジィング)スコアは7.6%向上し、トレーニングと共起メトリクスとは独立して原子性の利得を検証する。レコンストラクションのオーバーヘッドは控えめです。 Gemma 2 9Bの結果は方向性がある。逆収束SAEでは、同じパラメータ化が最良の結果となり、$+8.6\%$$$Δ$Fuzzとなる。指向性はあるものの、これはメソッドがより大きなモデルに移行することを奨励する兆候である。定性的分析により、多意味的トークンを発射する特徴は意味的に異なる部分空間に分割され、それぞれ異なる表現的部分空間に特化している。

論文の概要: MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents

関連論文リスト