Fugu-MT 論文翻訳(概要): Patch-Effect Graph Kernels for LLM Interpretability

論文の概要: Patch-Effect Graph Kernels for LLM Interpretability

arxiv url: http://arxiv.org/abs/2605.06480v1
Date: Thu, 07 May 2026 16:03:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.973747
Title: Patch-Effect Graph Kernels for LLM Interpretability
Title（参考訳）: LLM解釈性のためのパッチ効果グラフカーネル
Authors: Ruben Fernandez-Boullon, David N. Olivieri,
Abstract要約: 機械的解釈可能性(Mechanistic Interpretability)は、アクティベーションパッチによる因果回路の同定によってトランスフォーマー計算をリバースエンジニアリングすることを目的としている。本稿では,アクティベーション対応プロファイルをモデルコンポーネント上のパッチ効果グラフとして表現することにより,メカニスティック解析をグラフ機械学習問題として再構成するフレームワークを提案する。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mechanistic interpretability aims to reverse-engineer transformer computations by identifying causal circuits through activation patching. However, scaling these interventions across diverse prompts and task families produces high-dimensional, unstructured datasets that are difficult to compare systematically. We propose a framework that reframes mechanistic analysis as a graph machine-learning problem by representing activation-patching profiles as patch-effect graphs over model components. We introduce three graph-construction methods: direct-influence via causal mediation, partial-correlation, and co-influence and apply graph kernels to analyze the resulting structures. Evaluating this approach on GPT-2 Small using Indirect Object Identification (IOI) and related tasks, we find that patch-effect graphs preserve discriminative structural signals. Specifically, localized edge-slot features provide higher classification accuracy than global graph-shape descriptors. A screened paired-patching validation suggests that CI and PC selected candidate edges correspond to stronger activation-influence effects than random or low-rank candidates. Crucially, by evaluating these representations against rigorous prompt-only and raw patch-effect controls, we make the evidential scope of the benchmark explicit: graph features compress structured patching signal, while raw tensors and surface cues define strong baselines that any circuit-level claim should address. Ultimately, our framework provides a compression and evaluation pipeline for comparing patching-derived structures under controlled baselines, separating robust slice-discriminative evidence from stronger task-general causal-circuit claims.
Abstract（参考訳）: 機械的解釈可能性(Mechanistic Interpretability)は、アクティベーションパッチによる因果回路の同定によってトランスフォーマー計算をリバースエンジニアリングすることを目的としている。しかし、これらの介入を多様なプロンプトやタスクファミリにスケールすると、体系的に比較するのが難しい高次元の非構造化データセットが生成される。本稿では,アクティベーション対応プロファイルをモデルコンポーネント上のパッチ効果グラフとして表現することにより,メカニスティック解析をグラフ機械学習問題として再構成するフレームワークを提案する。本稿では, 因果媒介による直接影響, 部分相関, 共影響の3つのグラフ構築法を導入し, 結果の構造解析にグラフカーネルを適用した。 Indirect Object Identification (IOI) と関連するタスクを用いた GPT-2 でのこのアプローチの評価により,パッチ効果グラフが識別的構造信号を保存することがわかった。特に、ローカライズされたエッジスロット特徴は、グローバルグラフ形状記述子よりも高い分類精度を提供する。スクリーニングされたペアパッチによる検証では、CIとPCの選択した候補エッジは、ランダムまたはローランクの候補よりも強いアクティベーション-影響効果に対応することが示唆されている。重要なことに、これらの表現を厳密なプロンプトオンリーおよび生のパッチエフェクトコントロールに対して評価することにより、ベンチマークの明確なスコープを明示する:グラフは構造化されたパッチング信号を圧縮し、一方、生のテンソルと表面キューは、どの回路レベルのクレームも対応すべき強いベースラインを定義する。最終的に、我々のフレームワークは、制御されたベースラインの下でパッチ由来の構造を比較するための圧縮および評価パイプラインを提供し、より強力なタスク一般因果回路クレームから堅牢なスライス識別的証拠を分離する。

論文の概要: Patch-Effect Graph Kernels for LLM Interpretability

関連論文リスト