Fugu-MT 論文翻訳(概要): Binary Autoencoder for Mechanistic Interpretability of Large Language Models

論文の概要: Binary Autoencoder for Mechanistic Interpretability of Large Language Models

arxiv url: http://arxiv.org/abs/2509.20997v1
Date: Thu, 25 Sep 2025 10:48:48 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-26 20:58:12.849215
Title: Binary Autoencoder for Mechanistic Interpretability of Large Language Models
Title（参考訳）: 大規模言語モデルの機械的解釈性のためのバイナリオートエンコーダ
Authors: Hakaze Cho, Haolin Yang, Brian M. Kurkoski, Naoya Inoue,
Abstract要約: 隠れアクティベーションのミニバッチに最小エントロピーを強制する新しいバイナリオートエンコーダを提案する。効率的なエントロピー計算のために、ステップ関数を介して隠れたアクティベーションを1ビットに識別する。我々は、大規模言語モデルの推論力学を経験的に評価し、活用する。
参考スコア（独自算出の注目度）: 8.725176890854065
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing works are dedicated to untangling atomized numerical components (features) from the hidden states of Large Language Models (LLMs) for interpreting their mechanism. However, they typically rely on autoencoders constrained by some implicit training-time regularization on single training instances (i.e., $L_1$ normalization, top-k function, etc.), without an explicit guarantee of global sparsity among instances, causing a large amount of dense (simultaneously inactive) features, harming the feature sparsity and atomization. In this paper, we propose a novel autoencoder variant that enforces minimal entropy on minibatches of hidden activations, thereby promoting feature independence and sparsity across instances. For efficient entropy calculation, we discretize the hidden activations to 1-bit via a step function and apply gradient estimation to enable backpropagation, so that we term it as Binary Autoencoder (BAE) and empirically demonstrate two major applications: (1) Feature set entropy calculation. Entropy can be reliably estimated on binary hidden activations, which we empirically evaluate and leverage to characterize the inference dynamics of LLMs and In-context Learning. (2) Feature untangling. Similar to typical methods, BAE can extract atomized features from LLM's hidden states. To robustly evaluate such feature extraction capability, we refine traditional feature-interpretation methods to avoid unreliable handling of numerical tokens, and show that BAE avoids dense features while producing the largest number of interpretable ones among baselines, which confirms the effectiveness of BAE serving as a feature extractor.
Abstract（参考訳）: 既存の作業は、そのメカニズムを解釈するために、大規模言語モデル(LLM)の隠れ状態から原子化数値成分(機能)を解き放つことを目的としている。しかし、それらは通常、単一のトレーニングインスタンス(例えば$L_1$正規化、トップ-k関数など)で暗黙のトレーニング時間正規化によって制約されるオートエンコーダに依存し、インスタンス間のグローバルな疎結合を明示的に保証せず、大量の(同時に不活性な)機能を引き起こし、特徴の疎結合や原子化を損なう。本稿では,隠れたアクティベーションのミニバッチに最小エントロピーを強制し,インスタンス間の特徴独立性とスパーシリティを促進する,新しいオートエンコーダ変種を提案する。効率的なエントロピー計算のために,ステップ関数を用いて隠れたアクティベーションを1ビットに識別し,勾配推定を適用してバックプロパゲーションを実現し,これをバイナリオートエンコーダ(BAE)と呼び,(1)特徴集合エントロピー計算という2つの主要な応用を実証的に示す。エントロピーは二項隠れアクティベーションに基づいて確実に推定でき、LLMとインコンテクスト学習の推論ダイナミクスを実験的に評価し、活用する。 (2)特徴の解消。典型的な方法と同様に、BAEはLLMの隠された状態から原子化された特徴を抽出することができる。このような特徴抽出能力を確実に評価するために,従来の特徴解釈法を改良して数値トークンの信頼性の低い処理を回避し,ベースライン間で最も多くの解釈可能な特徴を発生させながら密度の高い特徴を回避し,特徴抽出器としてのBAEの有効性を確認した。

論文の概要: Binary Autoencoder for Mechanistic Interpretability of Large Language Models

関連論文リスト