Fugu-MT 論文翻訳(概要): Large Language Models for Fault Localization: An Empirical Study

論文の概要: Large Language Models for Fault Localization: An Empirical Study

arxiv url: http://arxiv.org/abs/2510.20521v1
Date: Thu, 23 Oct 2025 13:04:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:17.930385
Title: Large Language Models for Fault Localization: An Empirical Study
Title（参考訳）: フォールトローカライゼーションのための大規模言語モデル:実証的研究
Authors: YingJian Xiao, RongQun Hu, WeiWei Gong, HongWei Li, AnQuan Jie,
Abstract要約: 本稿では,大規模言語モデル (LLM) に関する体系的な実証的研究を,文レベルコード故障の局所化タスクにおいて提案する。我々は,オープンソースモデル (Qwen2.5-coder-32b-instruct, DeepSeek-V3) とクローズドソースモデル (GPT-4.1 mini, Gemini-2.5-flash) を評価し,その故障局所化機能を評価する。
参考スコア（独自算出の注目度）: 3.2111987440830974
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, particularly in automated program repair. However, the effectiveness of such repairs is highly dependent on the performance of upstream fault localization, for which comprehensive evaluations are currently lacking. This paper presents a systematic empirical study on LLMs in the statement-level code fault localization task. We evaluate representative open-source models (Qwen2.5-coder-32b-instruct, DeepSeek-V3) and closed-source models (GPT-4.1 mini, Gemini-2.5-flash) to assess their fault localization capabilities on the HumanEval-Java and Defects4J datasets. The study investigates the impact of different prompting strategies--including standard prompts, few-shot examples, and chain-of-reasoning--on model performance, with a focus on analysis across accuracy, time efficiency, and economic cost dimensions. Our experimental results show that incorporating bug report context significantly enhances model performance. Few-shot learning shows potential for improvement but exhibits noticeable diminishing marginal returns, while chain-of-thought reasoning's effectiveness is highly contingent on the model's inherent reasoning capabilities. This study not only highlights the performance characteristics and trade-offs of different models in fault localization tasks, but also offers valuable insights into the strengths of current LLMs and strategies for improving fault localization effectiveness.
Abstract（参考訳）: 大規模言語モデル(LLM)は、特にプログラムの自動修復において、コード関連のタスクにおいて顕著な機能を示した。しかし,これらの修復の有効性は,現在網羅的評価が不十分な上流断層の局所化の性能に大きく依存している。本稿では,ステートメントレベル符号故障の局所化タスクにおけるLCMの系統的研究について述べる。代表的なオープンソースモデル(Qwen2.5-coder-32b-instruct, DeepSeek-V3)とクローズドソースモデル(GPT-4.1 mini, Gemini-2.5-flash)を評価し,HumanEval-JavaおよびDefects4Jデータセットのフォールトローカライゼーション機能を評価する。本研究は, 精度, 時間効率, 経済コストの面から分析することに着目し, 標準的プロンプト, 少数ショット例, 連鎖型モデル性能など, 異なるプロンプト戦略の影響について検討した。実験結果から,バグレポートのコンテキストを組み込むことで,モデルの性能が著しく向上することが示された。ほとんどショット学習は改善の可能性を示さないが、明らかな限界リターンの減少を示す一方、チェーン・オブ・シークレット・推論の有効性はモデル固有の推論能力に強く依存している。本研究は, 故障局地化タスクにおける異なるモデルの性能特性とトレードオフだけでなく, 現状のLLMの強みや, 故障局地化の有効性向上戦略に関する貴重な知見を提供する。

論文の概要: Large Language Models for Fault Localization: An Empirical Study

関連論文リスト