Fugu-MT 論文翻訳(概要): From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM-as-a-Judge

論文の概要: From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM-as-a-Judge

arxiv url: http://arxiv.org/abs/2604.18309v1
Date: Mon, 20 Apr 2026 14:16:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.926381
Title: From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM-as-a-Judge
Title（参考訳）: プログラムスライスから因果的明確性: コンテキスト分割とLCM-as-a-Judgeによる忠実で行動可能なLCM生成障害説明の評価
Authors: Julius Porbeck, Christian Medeiros Adriano, Holger Giese,
Abstract要約: 誤解を招く説明は下流のタスクには有害である(例えば、バグトリアージ、バグ修正など)。本研究では,様々なコンテキスト構成による説明品質への影響について検討する。
参考スコア（独自算出の注目度）: 0.2230291569252836
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language model (LLM)-based debugging systems can generate failure explanations, but these explanations may be incomplete or incorrect. Misleading explanations are harmful for downstream tasks (e.g., bug triage, bug fixing). We investigate how explanation quality is affected by various LLM context configurations. Existing work predominantly treats LLM-generated failure explanations as an ad hoc by-product of debugging or repair workflows, using generic prompting over undifferentiated artifacts such as code, tests, and error messages rather than targeting explanations as a first-class output with dedicated quality assessment. Consequently, existing approaches provide limited support for assessing whether these explanations capture the underlying fault-error-failure mechanism and for actionable next steps, and most techniques instead prioritize task success (e.g., patch correctness or review quality) over the explicit causal explanation quality. We systematically vary the debugging information to study how distinct context compositions affect the quality of LLM-generated failure explanations. Across 93 context configurations on real bugs and three economically viable models (gpt-5-mini, DeepSeek-V3.2, and Grok-4.1-fast), we evaluate explanations with six criteria and validate the LLM-as-a-judge scores against human ratings in a user study. Our results indicate that explanation quality is causally affected by context composition. Evidence-rich, failure-specific artifacts improve causal and action-oriented quality, whereas overly large contexts tend to yield vague explanations. Higher explanation-score quartiles are associated with higher downstream repair pass rates and, for some models, with fixes that are closer to the reference minimal fixes. In contrast, low-score quartiles can even underperform the no-explanation baseline. Reproduction package is publicly available.
Abstract（参考訳）: 大規模言語モデル(LLM)ベースのデバッグシステムは、失敗の説明を生成することができるが、これらの説明は不完全または誤りである。誤解を招く説明は下流のタスク(例:バグトリアージ、バグ修正)には有害である。本研究では,LLMのコンテキスト構成が品質に与える影響について検討する。既存の作業は、コード、テスト、エラーメッセージといった未分化のアーティファクトを、専用の品質評価を備えたファーストクラスのアウトプットとしてではなく、ジェネリックプロンプトを使用して、デバッグや修正ワークフローの副産物としてLLM生成の障害説明を主に扱う。その結果、既存のアプローチでは、これらの説明が基盤となるフォールトエラー障害メカニズムをキャプチャし、実行可能な次のステップを判断するための限定的なサポートを提供しており、ほとんどのテクニックは、明確な因果的説明品質よりもタスクの成功(例えば、パッチの正確性やレビューの品質)を優先する。デバッグ情報を体系的に変更し、異なるコンテキスト構成がLLM生成障害説明の品質にどのように影響するかを調査する。実際のバグと経済的に実行可能な3つのモデル(gpt-5-mini、DeepSeek-V3.2、Grok-4.1-fast)に関する93のコンテキスト構成を6つの基準で評価し、人間の評価に対するLCM-as-a-judgeスコアをユーザスタディで検証した。以上の結果から,説明品質は文脈構成に因果関係があることが示唆された。エビデンスに富み、失敗固有の成果物は因果的およびアクション指向の質を改善するが、過度に大きなコンテキストはあいまいな説明をもたらす傾向がある。より高説明スコアのクォータイルは、下流の修理パス率が高く、一部のモデルでは、参照最小限の修正に近く修正されている。対照的に、低スコアの石英は、非説明ベースラインを過小評価することもある。再生産パッケージは公開されている。

論文の概要: From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM-as-a-Judge

関連論文リスト