Fugu-MT 論文翻訳(概要): Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders

論文の概要: Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders

arxiv url: http://arxiv.org/abs/2510.03659v1
Date: Sat, 04 Oct 2025 04:14:50 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.179987
Title: Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders
Title（参考訳）: 高い解釈性は有用か? : スパースオートエンコーダのペアワイズ解析
Authors: Xu Wang, Yan Hu, Benyou Wang, Difan Zou,
Abstract要約: 3つの言語モデルで90のSAEをトレーニングし、解釈可能性と操舵性を評価します。解析の結果,比較的弱い正の相関(tau b approx 0.298)しか示さず,解釈性は操舵性能の指標として不十分であることが示唆された。本稿では,特徴量の増幅が次のトークン分布に与える影響を計測するデルタトークン信頼性(Delta Token Confidence)という新しい選択基準を提案する。
参考スコア（独自算出の注目度）: 63.544453925182005
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sparse Autoencoders (SAEs) are widely used to steer large language models (LLMs), based on the assumption that their interpretable features naturally enable effective model behavior steering. Yet, a fundamental question remains unanswered: does higher interpretability indeed imply better steering utility? To answer this question, we train 90 SAEs across three LLMs (Gemma-2-2B, Qwen-2.5-3B, Gemma-2-9B), spanning five architectures and six sparsity levels, and evaluate their interpretability and steering utility based on SAEBench (arXiv:2501.12345) and AxBench (arXiv:2502.23456) respectively, and perform a rank-agreement analysis via Kendall's rank coefficients (tau b). Our analysis reveals only a relatively weak positive association (tau b approx 0.298), indicating that interpretability is an insufficient proxy for steering performance. We conjecture the interpretability utility gap may stem from the selection of SAE features, as not all of them are equally effective for steering. To further find features that truly steer the behavior of LLMs, we propose a novel selection criterion called Delta Token Confidence, which measures how much amplifying a feature changes the next token distribution. We show that our method improves the steering performance of three LLMs by 52.52 percent compared to the current best output score based criterion (arXiv:2503.34567). Strikingly, after selecting features with high Delta Token Confidence, the correlation between interpretability and utility vanishes (tau b approx 0), and can even become negative. This further highlights the divergence between interpretability and utility for the most effective steering features.
Abstract（参考訳）: スパースオートエンコーダ (SAEs) は,大きな言語モデル (LLMs) のステアリングに広く用いられている。しかし、根本的な疑問は未解決のままである: 高い解釈可能性は実際、より優れた操舵ユーティリティを意味するのだろうか? この疑問に答えるために、90のSAEを3つのLLM(Gemma-2-2B, Qwen-2.5-3B, Gemma-2-9B)でトレーニングし、5つのアーキテクチャと6つの空間レベルにまたがり、SAEBench(arXiv:2501.12345)とAxBench(arXiv:2502.23456)に基づいて解釈可能性と操舵性を評価し、Kendallのランク係数(tau b)を介してランクアグリメント分析を行う。解析の結果,比較的弱い正の相関(tau b approx 0.298)しか示さず,解釈性は操舵性能の指標として不十分であることが示唆された。本研究は,SAE特徴の選択による解釈可能性のギャップを推察するものであり,これらすべてがステアリングに等しく有効であるわけではない。さらに, LLMの挙動を真に制御する特徴を見出すため, デルタトークン信頼と呼ばれる新しい選択基準を提案し, 次のトークン分布をどの程度増幅するかを計測する。本手法は,現在最高の出力スコアベース基準(arXiv:2503.34567)と比較して,3 LLMのステアリング性能を52.52%向上することを示す。興味深いことに、Delta Token Confidence の高い機能を選択した後、解釈可能性とユーティリティの相関は消滅する(tau b approx 0)。このことは、最も効果的なステアリング機能に対する解釈可能性とユーティリティの相違をさらに強調する。

論文の概要: Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders

関連論文リスト