Fugu-MT 論文翻訳(概要): Causal Interpretation of Neural Network Computations with Contribution Decomposition

論文の概要: Causal Interpretation of Neural Network Computations with Contribution Decomposition

arxiv url: http://arxiv.org/abs/2603.06557v1
Date: Fri, 06 Mar 2026 18:46:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:46.403135
Title: Causal Interpretation of Neural Network Computations with Contribution Decomposition
Title（参考訳）: 寄与分解によるニューラルネットワーク計算の因果解釈
Authors: Joshua Brendan Melander, Zaki Alaoui, Shenghua Liu, Surya Ganguli, Stephen A. Baccus,
Abstract要約: 我々は、隠れたニューロンがどのようにネットワーク出力を駆動するかを直接的に調べる。 CODECは、スパースオートエンコーダを用いて、ネットワークの振る舞いを隠されたニューロンの寄与のスパースモチーフに分解する手法である。
参考スコア（独自算出の注目度）: 13.992892699439023
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding how neural networks transform inputs into outputs is crucial for interpreting and manipulating their behavior. Most existing approaches analyze internal representations by identifying hidden-layer activation patterns correlated with human-interpretable concepts. Here we take a direct approach to examine how hidden neurons act to drive network outputs. We introduce CODEC (Contribution Decomposition), a method that uses sparse autoencoders to decompose network behavior into sparse motifs of hidden-neuron contributions, revealing causal processes that cannot be determined by analyzing activations alone. Applying CODEC to benchmark image-classification networks, we find that contributions grow in sparsity and dimensionality across layers and, unexpectedly, that they progressively decorrelate positive and negative effects on network outputs. We further show that decomposing contributions into sparse modes enables greater control and interpretation of intermediate layers, supporting both causal manipulations of network output and human-interpretable visualizations of distinct image components that combine to drive that output. Finally, by analyzing state-of-the-art models of neural activity in the vertebrate retina, we demonstrate that CODEC uncovers combinatorial actions of model interneurons and identifies the sources of dynamic receptive fields. Overall, CODEC provides a rich and interpretable framework for understanding how nonlinear computations evolve across hierarchical layers, establishing contribution modes as an informative unit of analysis for mechanistic insights into artificial neural networks.
Abstract（参考訳）: ニューラルネットワークが入力を出力に変換する方法を理解することは、その振る舞いを解釈し、操作するために重要である。既存のほとんどの手法は、人間の解釈可能な概念と相関した隠れ層活性化パターンを同定することで内部表現を分析する。ここでは、隠れたニューロンがどのようにネットワーク出力を駆動するかを直接的に調べる。 CODEC(Contribution Decomposition)は、スパースオートエンコーダを用いて、ネットワークの動作を隠されたニューロンの寄与のスパースモチーフに分解する手法であり、アクティベーションの分析だけでは決定できない因果過程を明らかにする。画像分類ネットワークのベンチマークにCODECを適用すると、レイヤ間の空間性や次元性においてコントリビューションが増加し、予期せぬことに、ネットワーク出力に対する肯定的および否定的な影響が徐々に減少することがわかった。さらに,コントリビューションをスパースモードに分解することで,ネットワーク出力の因果操作と,その出力を駆動する異なる画像成分の人間の解釈可能な可視化の両方をサポートする,中間層の制御と解釈がより容易になることを示す。最後に、脊椎動物網膜における神経活動の最先端モデルを分析することにより、CODECがモデルニューロンの結合作用を明らかにし、動的受容野の源を同定することを示した。全体として、CODECは階層層を横断して非線形計算がどのように進化するかを理解するためのリッチで解釈可能なフレームワークを提供する。

論文の概要: Causal Interpretation of Neural Network Computations with Contribution Decomposition

関連論文リスト