Fugu-MT 論文翻訳(概要): Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

論文の概要: Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

arxiv url: http://arxiv.org/abs/2604.17761v1
Date: Mon, 20 Apr 2026 03:24:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.676782
Title: Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
Title（参考訳）: リアリスティックベンチマークによるLLM故障の解釈可能性解析
Authors: Rongyuan Tan, Jue Zhang, Zhuozhao Li, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang,
Abstract要約: 解釈可能性ツールは、大規模言語モデル(LLM)の失敗を分析するために、ますます使われています。現実的な環境下でのLLM故障を解析するための実用的なツールとして,LRPに基づく帰属性について検討した。以上の結果から,このトークンレベルのコントラスト属性は,いくつかの障害事例において情報信号が得られるが,一般には適用できないことがわかった。
参考スコア（独自算出の注目度）: 42.92210265283373
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Interpretability tools are increasingly used to analyze failures of Large Language Models (LLMs), yet prior work largely focuses on short prompts or toy settings, leaving their behavior on commonly used benchmarks underexplored. To address this gap, we study contrastive, LRP-based attribution as a practical tool for analyzing LLM failures in realistic settings. We formulate failure analysis as \textit{contrastive attribution}, attributing the logit difference between an incorrect output token and a correct alternative to input tokens and internal model states, and introduce an efficient extension that enables construction of cross-layer attribution graphs for long-context inputs. Using this framework, we conduct a systematic empirical study across benchmarks, comparing attribution patterns across datasets, model sizes, and training checkpoints. Our results show that this token-level contrastive attribution can yield informative signals in some failure cases, but is not universally applicable, highlighting both its utility and its limitations for realistic LLM failure analysis. Our code is available at: https://aka.ms/Debug-XAI.
Abstract（参考訳）: 解釈可能性ツールは、Large Language Models (LLMs) の失敗を分析するのにますます使われていますが、以前の作業は主にショートプロンプトやおもちゃの設定に重点を置いており、一般的に使われているベンチマークにその振る舞いが過小評価されています。このギャップに対処するため,現実的な環境下でのLLM故障を解析するための実用的なツールとして,LRPに基づく帰属性について検討した。本稿では,不正な出力トークンと入力トークンと内部モデル状態とのロジット差に起因する「textit{contrastive Attribution}」として障害解析を定式化し,長文入力のための層間属性グラフの構築を可能にする効率的な拡張を導入する。このフレームワークを使用することで、データセット間の属性パターン、モデルサイズ、トレーニングチェックポイントを比較し、ベンチマークを横断する体系的な実証的研究を行う。以上の結果から,このトークンレベルのコントラスト属性は,いくつかの障害事例では情報信号が得られるが,汎用性は認められず,現実的なLCM故障解析におけるその有用性と限界を強調した。私たちのコードは、https://aka.ms/Debug-XAI.com/で利用可能です。

論文の概要: Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

関連論文リスト