Fugu-MT 論文翻訳(概要): Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features

論文の概要: Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features

arxiv url: http://arxiv.org/abs/2605.12874v1
Date: Wed, 13 May 2026 01:41:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:27.750716
Title: Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features
Title（参考訳）: スパースオートエンコーダにおける記述的衝突:1つの説明が多くの特徴を記述する場合
Authors: Jordan F. McCann,
Abstract要約: 私たちは衝突と呼ばれる問題を特定します。多くの異なるSAE機能は、同じ説明を認めています。判別と呼ばれる特性を定式化し、現在の検出スタイルの自己解釈可能性スコアが衝突に不変であることを証明した。衝突検出と識別スコアの2つの相補的な補正指標を提案し、隣人との特徴を区別できない説明を明示的にペナルティ化する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse autoencoders (SAEs) are now standard tools for decomposing language model activations into interpretable features, and automated interpretability pipelines routinely assign each feature a short natural-language explanation. Existing critiques of this practice focus on polysemanticity -- one feature with many meanings -- or on whether explanations predict activations. We identify a complementary, structurally distinct problem we call descriptive collision: many distinct SAE features admit the same explanation. Reanalyzing the largest publicly-available dataset of human-annotated SAE features (Marks et al., 2025), comprising 722 annotated features across Gemma 2 2B and Pythia 70M, we find that the mean annotation string is reused across 3.07 features; 82.1% of features share their annotation with at least one other feature; and the single most common annotation string ("plural nouns") labels 101 distinct features spanning 18 layers and four model components. Information-theoretically, the average annotation resolves only 70% of feature identity. We formalize a property called discrimination, prove that current detection-style auto-interpretability scoring is invariant to collision, and propose two complementary corrective metrics -- collision-adjusted detection and discrimination scoring -- that explicitly penalize explanations that fail to distinguish a feature from its neighbors. The collision problem is independent of, and additive with, previously identified failure modes of auto-interpretability; ignoring it inflates reported feature interpretability by a quantity equal to roughly one-third of the bits required to identify a feature.
Abstract（参考訳）: スパースオートエンコーダ(SAE)は、言語モデルのアクティベーションを解釈可能な機能に分解するための標準ツールとなり、自動解釈可能性パイプラインは、各機能を短い自然言語説明として日常的に割り当てる。このプラクティスの既存の批判は、多意味性(多意味性)、あるいは説明がアクティベーションを予測するかどうかに焦点を当てている。私たちは、記述的衝突(descriptive collision)と呼ぶ相補的で構造的に異なる問題を特定します。 Gemma 2 2B と Pythia 70M にまたがる 722 のアノテート機能を含む,人間アノテーション付きSAE 機能データセット (Marks et al , 2025 ) を解析した結果,平均アノテーション文字列が 3.07 の機能にわたって再利用されていることがわかった。情報理論では、平均的なアノテーションは特徴アイデンティティの70%しか解決しない。我々は、識別と呼ばれる特性を定式化し、現在の検出スタイルの自己解釈可能性スコアが衝突に不変であることを証明し、隣人との特徴を区別できない説明を明示的に罰する2つの補完的補正指標(衝突調整検出と識別スコア)を提案する。衝突問題は、事前に特定された自己解釈可能性の障害モードと独立かつ付加的であり、それを無視することで、その特徴を特定するのに必要なビットの約3分の1に等しい量で、報告された特徴解釈可能性を無視している。

論文の概要: Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features

関連論文リスト