Fugu-MT 論文翻訳(概要): Disentangling Polysemantic Neurons with a Null-Calibrated Polysemanticity Index and Causal Patch Interventions

論文の概要: Disentangling Polysemantic Neurons with a Null-Calibrated Polysemanticity Index and Causal Patch Interventions

arxiv url: http://arxiv.org/abs/2508.16950v1
Date: Sat, 23 Aug 2025 08:48:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-26 18:43:45.270554
Title: Disentangling Polysemantic Neurons with a Null-Calibrated Polysemanticity Index and Causal Patch Interventions
Title（参考訳）: Null-Calibrated Polysemanticity Index と Causal Patch Intervention を用いた遠心性ポリセマンティックニューロン
Authors: Manan Gupta, Dhruv Kumar,
Abstract要約: ポリセマンティリティ指数(英: Polysemanticity Index、PSI)は、ニューロンのトップアクティベーションが意味的に異なるクラスタに分解されたときを定量化する、ヌルキャリブレーションの指標である。 Tiny-ImageNetの画像で評価された事前トレーニングされたResNet-50では、PSIは活性化セットをコヒーレントな名前のプロトタイプに分割したニューロンを特定する。
参考スコア（独自算出の注目度）: 4.032680910442999
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Neural networks often contain polysemantic neurons that respond to multiple, sometimes unrelated, features, complicating mechanistic interpretability. We introduce the Polysemanticity Index (PSI), a null-calibrated metric that quantifies when a neuron's top activations decompose into semantically distinct clusters. PSI multiplies three independently calibrated components: geometric cluster quality (S), alignment to labeled categories (Q), and open-vocabulary semantic distinctness via CLIP (D). On a pretrained ResNet-50 evaluated with Tiny-ImageNet images, PSI identifies neurons whose activation sets split into coherent, nameable prototypes, and reveals strong depth trends: later layers exhibit substantially higher PSI than earlier layers. We validate our approach with robustness checks (varying hyperparameters, random seeds, and cross-encoder text heads), breadth analyses (comparing class-only vs. open-vocabulary concepts), and causal patch-swap interventions. In particular, aligned patch replacements increase target-neuron activation significantly more than non-aligned, random, shuffled-position, or ablate-elsewhere controls. PSI thus offers a principled and practical lever for discovering, quantifying, and studying polysemantic units in neural networks.
Abstract（参考訳）: ニューラルネットワークはしばしば、複数の、時には無関係な特徴に反応し、機械的解釈可能性の複雑化を伴う多意味ニューロンを含んでいる。我々は、ニューロンのトップアクティベーションが意味的に異なるクラスタに分解されるタイミングを定量化する、ヌル校正指標であるPolysemanticity Index (PSI)を導入する。 PSIは、幾何学的クラスタ品質(S)、ラベル付きカテゴリ(Q)へのアライメント、CLIP(D)によるオープンボキャブラリセマンティリティの3つの独立した校正されたコンポーネントを乗算する。 Tiny-ImageNetイメージで評価された事前トレーニングされたResNet-50では、PSIはアクティベーションセットをコヒーレントなプロトタイプに分割したニューロンを特定し、強力な深度傾向を示す。我々は,ロバストネスチェック(ハイパーパラメータ,ランダムシード,クロスエンコーダテキストヘッド),幅解析(クラスのみとオープンボキャブラリの概念を比較),因果パッチスワップ介入によるアプローチを検証する。特に、アライメントパッチの置換は、非アライメント、ランダム、シャッフル、またはアブレートエルセの制御よりも、標的ニューロンの活性化を著しく増加させる。したがって、PSIはニューラルネットワークにおける多意味単位を発見し、定量化し、研究するための原則的で実用的なレバーを提供する。

論文の概要: Disentangling Polysemantic Neurons with a Null-Calibrated Polysemanticity Index and Causal Patch Interventions

関連論文リスト