Fugu-MT 論文翻訳(概要): GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging

論文の概要: GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging

arxiv url: http://arxiv.org/abs/2508.18993v2
Date: Sun, 14 Sep 2025 17:21:03 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-16 15:23:16.37258
Title: GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging
Title（参考訳）: GitTaskBench: コードリポジトリのレバレッジによる実世界のタスク解決のためのベンチマーク
Authors: Ziyi Ni, Huacan Wang, Shuo Zhang, Shuo Lu, Ziyang He, Wang You, Zhenheng Tang, Yuntao Du, Bill Sun, Hongzhang Liu, Sen Hu, Ronghao Chen, Bo Li, Xin Li, Chen Hu, Binxing Jiao, Daxin Jiang, Pin Lyu,
Abstract要約: 実際のシナリオでコードエージェントを評価するベンチマークであるGitTaskBenchをリリースしています。各タスクは、自動化された人為的な評価ハーネスと関連するリポジトリをペアリングする。また,エージェント性能の経済的利益を定量化するためのα値指標を提案する。
参考スコア（独自算出の注目度）: 41.754784344572286
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Beyond scratch coding, exploiting large-scale code repositories (e.g., GitHub) for practical tasks is vital in real-world software development, yet current benchmarks rarely evaluate code agents in such authentic, workflow-driven scenarios. To bridge this gap, we introduce GitTaskBench, a benchmark designed to systematically assess this capability via 54 realistic tasks across 7 modalities and 7 domains. Each task pairs a relevant repository with an automated, human-curated evaluation harness specifying practical success criteria. Beyond measuring execution and task success, we also propose the alpha-value metric to quantify the economic benefit of agent performance, which integrates task success rates, token cost, and average developer salaries. Experiments across three state-of-the-art agent frameworks with multiple advanced LLMs show that leveraging code repositories for complex task solving remains challenging: even the best-performing system, OpenHands+Claude 3.7, solves only 48.15% of tasks (recent progress has pushed the frontier further, with RepoMaster+Claude 3.5 achieving a new record of 62.96%). Error analysis attributes over half of failures to seemingly mundane yet critical steps like environment setup and dependency resolution, highlighting the need for more robust workflow management and increased timeout preparedness. By releasing GitTaskBench, we aim to drive progress and attention toward repository-aware code reasoning, execution, and deployment -- moving agents closer to solving complex, end-to-end real-world tasks. The benchmark and code are open-sourced at https://github.com/QuantaAlpha/GitTaskBench.
Abstract（参考訳）: スクラッチコーディング以外にも、実践的なタスクに大規模なコードリポジトリ(GitHubなど)を活用することは、現実のソフトウェア開発において不可欠だが、現在のベンチマークでは、そのような真正なワークフロー駆動のシナリオでコードエージェントを評価することはめったにない。このギャップを埋めるために、GitTaskBenchを紹介します。これは、7つのモダリティと7つのドメインにわたる54の現実的なタスクを通じて、この機能を体系的に評価するように設計されたベンチマークです。各タスクは、実際の成功基準を指定する自動化された人為的な評価手法と関連するリポジトリをペアリングする。また, タスク成功率, トークンコスト, 平均開発給与を統合し, エージェントパフォーマンスの経済的利益を定量化するためのα値指標を提案する。複数の高度なLCMを持つ最先端の3つのエージェントフレームワークの実験では、複雑なタスク解決にコードリポジトリを活用することは依然として難しいことが示されている。最高のパフォーマンスシステムであるOpenHands+Claude 3.7でさえ、わずか48.15%のタスクしか解決していない(最近の進歩により、RepoMaster+Claude 3.5が62.96%の新記録を達成した)。エラー解析は、環境のセットアップや依存関係の解決といった、日常的で重要なステップのように見える障害の半分以上を占めており、より堅牢なワークフロー管理とタイムアウトの準備の必要性を強調している。 GitTaskBenchをリリースすることで、リポジトリを意識したコード推論、実行、デプロイメントへの進捗と注意を喚起し、エージェントを複雑なエンドツーエンドの現実世界タスクの解決に近づけることを目指しています。ベンチマークとコードはhttps://github.com/QuantaAlpha/GitTaskBench.comで公開されている。

論文の概要: GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging

関連論文リスト