Fugu-MT 論文翻訳(概要): A Benchmark for Evaluating Repository-Level Code Agents with Intermediate Reasoning on Feature Addition Task

論文の概要: A Benchmark for Evaluating Repository-Level Code Agents with Intermediate Reasoning on Feature Addition Task

arxiv url: http://arxiv.org/abs/2603.26337v1
Date: Fri, 27 Mar 2026 11:58:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-30 21:49:48.482767
Title: A Benchmark for Evaluating Repository-Level Code Agents with Intermediate Reasoning on Feature Addition Task
Title（参考訳）: 機能追加タスクの中間推論によるリポジトリレベルコードエージェントの評価ベンチマーク
Authors: Shuhan Liu, Zhiyi Zhao, Xing Hu, Kui Liu, Xiaohu Yang, Xin Xia,
Abstract要約: RACE-benchは、機能追加タスクでコードエージェントを評価するための推論強化ベンチマークである。 RACE-benchには、12のオープンソースリポジトリから528の現実世界の機能追加インスタンスが含まれている。 RACE-bench上での3つのリポジトリレベルのコードエージェントの評価を行った。
参考スコア（独自算出の注目度）: 11.218318079376365
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Repository-level code agents have shown strong promise in real-world feature addition tasks, making reliable evaluation of their capabilities increasingly important. However, existing benchmarks primarily evaluate these agents as black boxes based on final test correctness, providing limited insight into how they reason and where failures arise. To address this limitation, we introduce RACE-bench, a reasoning-augmented benchmark for evaluating code agents on repository-level feature addition tasks. RACE-bench contains 528 real-world feature addition instances from 12 open-source repositories. Each instance is paired with executable patch verification and structured intermediate reasoning ground truth covering issue understanding, file localization, implementation tasks, and step decomposition. Based on this design, we introduce a dual-track evaluation framework that jointly measures patch correctness and intermediate reasoning quality. We evaluate three representative repository-level code agents on RACE-bench. On the full benchmark, Resolved Rates range from 29% to 70% across different agents. Our reasoning-level analysis further shows that while current agents perform well at understanding high-level intent, their performance degrades substantially when translating intent into concrete implementation steps. We also find that apply-success but test-fail cases exhibit lower reasoning recall (35.7% decrease) and higher over-prediction (94.1% increase) compared to successful cases. These findings highlight the importance of evaluating repository-level code agents beyond final patch correctness by examining the quality of their reasoning processes.
Abstract（参考訳）: リポジトリレベルのコードエージェントは、現実世界の機能追加タスクに強い期待を示しており、その能力の信頼性がますます重要になっている。しかし、既存のベンチマークでは、これらのエージェントを最終テストの正しさに基づいてブラックボックスとして評価し、それらの原因と障害の発生場所について限定的な洞察を与えている。この制限に対処するために、リポジトリレベルの機能追加タスクでコードエージェントを評価するための推論強化ベンチマークであるRSE-benchを紹介します。 RACE-benchには、12のオープンソースリポジトリから528の現実世界の機能追加インスタンスが含まれている。各インスタンスは、実行可能なパッチ検証と構造化された中間推論の真実との組み合わせで、問題理解、ファイルのローカライゼーション、実装タスク、ステップ分解をカバーしている。この設計に基づいて,修正精度と中間的推論品質を共同で測定するデュアルトラック評価フレームワークを導入する。 RACE-bench上での3つのリポジトリレベルのコードエージェントの評価を行った。完全なベンチマークでは、Resolved Ratesは異なるエージェントに対して29%から70%の範囲である。我々の推論レベル分析は、現在のエージェントが高レベルのインテントを理解するのに優れているが、インテントを具体的な実装手順に変換すると、そのパフォーマンスは大幅に低下することを示している。また, 適用例と検査例は, 成功例と比較して推理リコールが低く(35.7%減少), 過剰摂取が増加(94.1%増加)していることがわかった。これらの知見は、リポジトリレベルのコードエージェントを最終的なパッチの正当性を超えて評価することの重要性を強調し、それらの推論プロセスの品質を検証した。

論文の概要: A Benchmark for Evaluating Repository-Level Code Agents with Intermediate Reasoning on Feature Addition Task

関連論文リスト