Fugu-MT 論文翻訳(概要): Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework

論文の概要: Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework

arxiv url: http://arxiv.org/abs/2510.03282v1
Date: Sun, 28 Sep 2025 18:34:43 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 19:16:49.462863
Title: Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework
Title（参考訳）: ハイブリッド属性・プルーニング・フレームワークによる変圧器回路の発見
Authors: Hao Gu, Vibhas Nair, Amrithaa Ashok Kumar, Jayvart Sharma, Ryan Lagasse,
Abstract要約: 本研究は,属性パッチを用いて高電位部分グラフを同定するハイブリッド属性・プルーニングフレームワークを提案する。回路忠実度を犠牲にすることなく,HAPはベースラインアルゴリズムよりも46%高速であることを示す。
参考スコア（独自算出の注目度）: 4.336808542533343
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Interpreting language models often involves circuit analysis, which aims to identify sparse subnetworks, or circuits, that accomplish specific tasks. Existing circuit discovery algorithms face a fundamental trade-off: attribution patching is fast but unfaithful to the full model, while edge pruning is faithful but computationally expensive. This research proposes a hybrid attribution and pruning (HAP) framework that uses attribution patching to identify a high-potential subgraph, then applies edge pruning to extract a faithful circuit from it. We show that HAP is 46\% faster than baseline algorithms without sacrificing circuit faithfulness. Furthermore, we present a case study on the Indirect Object Identification task, showing that our method preserves cooperative circuit components (e.g. S-inhibition heads) that attribution patching methods prune at high sparsity. Our results show that HAP could be an effective approach for improving the scalability of mechanistic interpretability research to larger models. Our code is available at https://anonymous.4open.science/r/HAP-circuit-discovery.
Abstract（参考訳）: 言語モデルの解釈には、特定のタスクを遂行するスパースサブネットワーク(サーキット)を特定することを目的としたサーキット分析が含まれることが多い。既存の回路発見アルゴリズムは基本的なトレードオフに直面している。帰属パッチは高速だが完全なモデルには不信であり、エッジプルーニングは忠実だが計算コストが高い。本研究では,属性パッチを用いて高電位サブグラフを同定し,エッジプルーニングを用いて忠実回路を抽出するハイブリッド属性・プルーニング(HAP)フレームワークを提案する。その結果,HAPは回路忠実度を犠牲にすることなく,ベースラインアルゴリズムよりも46倍高速であることがわかった。さらに, 間接物体識別タスクのケーススタディとして, 高頻度で発生する帰属パッチ手法による協調回路成分(例えばS阻害ヘッド)の保存について述べる。以上の結果から,HAPは機械的解釈可能性研究のスケーラビリティ向上に有効な手法である可能性が示唆された。私たちのコードはhttps://anonymous.4open.science/r/HAP-circuit-discoveryで公開されています。

論文の概要: Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework

関連論文リスト