Fugu-MT 論文翻訳(概要): Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?

論文の概要: Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?

arxiv url: http://arxiv.org/abs/2604.17338v2
Date: Fri, 24 Apr 2026 00:21:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-27 15:36:26.19812
Title: Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?
Title（参考訳）: 正確なデバッギングベンチマーク: あなたのモデルはデバッギングか、あるいはリジェネレーションか?
Authors: Wang Bill Zhu, Miaosen Chai, Shangshang Wang, Yejia Liu, Song Bian, Honghua Dong, Willie Neiswanger, Robin Jia,
Abstract要約: このフレームワークは,任意のコーディングデータセットを,精度を意識したベンチマークに自動的に変換する。必要な編集回数と、解決したバグ数を計測する2つの新しいメトリクスである、編集レベルの精度とバグレベルのリコールを定義します。実験では、GPT-5.1-CodexやDeepSeek-V3.2-Thinkingのようなフロンティアモデルが76%を超えるが、精度は45%以下である。
参考スコア（独自算出の注目度）: 31.082688278576356
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Unlike code completion, debugging requires localizing faults and applying targeted edits. We observe that frontier LLMs often regenerate correct but over-edited solutions during debugging. To evaluate how far LLMs are from precise debugging, we introduce the Precise Debugging Benchmark (PDB) framework, which automatically converts any coding dataset into a debugging benchmark with precision-aware evaluation. PDB generates buggy programs by synthesizing verified atomic bugs and composing them into multi-bug programs. We define two novel metrics, edit-level precision and bug-level recall, which measures how many necessary edits are made and how many bugs are resolved. We release two evaluation benchmarks: PDB-Single-Hard on single-line bugs, and PDB-Multi on multi-line bugs. Experiments show that frontier models, such as GPT-5.1-Codex and DeepSeek-V3.2-Thinking, achieve unit-test pass rates above 76% but exhibit precision below 45%, even when explicitly instructed to perform minimal debugging. Finally, we show that iterative and agentic debugging strategies do not substantially improve precision or recall, highlighting the need to rethink post-training pipelines for coding models.
Abstract（参考訳）: コード補完とは異なり、デバッグにはフォールトをローカライズし、ターゲットとする編集を適用する必要がある。我々は、デバッグ中にフロンティアLLMが正しいが過度に編集されたソリューションをしばしば再生するのを観察する。 LLMが正確なデバッグからどのくらい遠いかを評価するために、我々は、任意のコーディングデータセットを精度の高い評価を伴うデバッグベンチマークに自動的に変換するPrecise Debugging Benchmark(PDB)フレームワークを紹介します。 PDBは、検証済みのアトミックバグを合成し、それをマルチバグプログラムに構成することで、バギープログラムを生成する。必要な編集回数と、解決したバグ数を計測する2つの新しいメトリクスである、編集レベルの精度とバグレベルのリコールを定義します。単行バグのPDB-Single-Hardと複数行バグのPDB-Multiの2つの評価ベンチマークをリリースする。実験の結果、GPT-5.1-CodexやDeepSeek-V3.2-Thinkingのようなフロンティアモデルは76%以上の単体テストで通過するが、最小限のデバッグを明示的に指示された場合でも45%未満の精度を示すことがわかった。最後に、反復的およびエージェント的デバッグ戦略は精度やリコールを大幅に改善するものではないことを示し、コーディングモデルのための後トレーニングパイプラインを再考する必要性を強調した。

論文の概要: Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?

関連論文リスト