Fugu-MT 論文翻訳(概要): Rational Sparse Autoencoder

論文の概要: Rational Sparse Autoencoder

arxiv url: http://arxiv.org/abs/2606.14990v2
Date: Tue, 16 Jun 2026 02:02:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-17 15:01:46.723174
Title: Rational Sparse Autoencoder
Title（参考訳）: 合理的スパースオートエンコーダ
Authors: Naiyu Yin, Yue Yu,
Abstract要約: 本稿では、固定エンコーダアクティベーションをトレーニング可能な有理関数に置き換えるRational Sparse Autoencoder(RSAE)を紹介する。 RSAEは、リコンストラクションサイドメトリクスと下流ビヘイビアメトリクスの両方において、微調整後の厳格な改善を行っている。これらのゲインはホスト言語モデル、ベースラインアクティベーションファミリ、そしてテストしたベースラインの完全範囲にわたって一貫しています。
参考スコア（独自算出の注目度）: 14.27315714880774
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse autoencoders (SAEs) are standard tools for mechanistic interpretability, but current SAE families are constrained by fixed encoder nonlinearities such as ReLU, JumpReLU, and TopK. This hard-codes a particular sparsity mechanism into the model and can distort the reconstruction-versus-sparsity trade-off. We introduce the Rational Sparse Autoencoder (RSAE), which replaces the fixed encoder activation with a trainable rational function. Rational activations are flexible enough to uniformly approximate the activation primitives used by existing SAE families on compact domains (for TopK, the thresholded gate obtained after a separating top-k threshold is supplied), while also providing a richer function class for adapting to the observed pre-activation geometry. We realise this idea through a two-stage pipeline: an initialisation procedure that copies the pre-trained baseline SAE weights, plugs in rational coefficients obtained by the relaxed Remez exchange on synthetic data, and calibrates the scale parameters along with the rational coefficients; followed by a fine-tuning step under the standard sparsity-regularised reconstruction objective. Empirically, on residual-stream activations of three open-weight language models and across all three baseline activation families, the RSAE strictly improves on it after the fine-tuning step, both on reconstruction-side metrics and on downstream-behaviour metrics, without sacrificing feature-level interpretability under sparse probing. These gains are consistent across host language models, across baseline activation families, and across the full range of baseline sparsity we tested, while the upgrade itself adds only a handful of scalar parameters per autoencoder and runs in minutes on a single consumer GPU.
Abstract（参考訳）: スパースオートエンコーダ(SAE)は機械的解釈の標準的なツールであるが、現在のSAEファミリーはReLU、JumpReLU、TopKなどの固定エンコーダ非線形性によって制約されている。これにより、特定の疎結合機構をモデルにハードコードし、再構成対疎結合のトレードオフを歪ませることができる。本稿では、固定エンコーダアクティベーションをトレーニング可能な有理関数に置き換えるRational Sparse Autoencoder(RSAE)を紹介する。合理的なアクティベーションは、コンパクトドメイン上の既存のSAEファミリーが使用するアクティベーションプリミティブ(TopKの場合、トップk閾値の分離後に得られる閾値ゲートが供給される)を均一に近似するのに十分柔軟であり、また、観察された事前アクティベーション幾何に適応するためのよりリッチな関数クラスを提供する。このアイデアは、2段階のパイプラインを通して実現される: 事前訓練されたベースラインSAE重みをコピーし、緩和されたRemez交換によって得られた有理係数を合成データ上にプラグインし、その有理係数とともにスケールパラメータを校正する初期化手順。実験的に、3つのオープンウェイト言語モデルの残ストリームアクティベーションと3つのベースラインアクティベーションファミリーの残ストリームアクティベーションにおいて、RSAEは、スパースプローブ下での特徴レベルの解釈性を犠牲にすることなく、再構築側メトリクスと下流側メトリクスの両方において、微調整後の厳密な改善を行う。これらのゲインは、ホスト言語モデル、ベースラインアクティベーションファミリ、そして私たちがテストしたベースラインの完全範囲にわたって一貫していますが、アップグレード自体はオートエンコーダ毎にわずかなスカラーパラメータのみを追加し、単一のコンシューマGPU上で数分で実行します。

論文の概要: Rational Sparse Autoencoder

関連論文リスト