Fugu-MT 論文翻訳(概要): Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation

論文の概要: Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation

arxiv url: http://arxiv.org/abs/2510.24358v1
Date: Tue, 28 Oct 2025 12:26:45 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-29 15:35:37.110482
Title: Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation
Title（参考訳）: エージェント駆動アノテーションと評価によるLLMコードエージェントの自動ベンチマーク
Authors: Lingyue Fu, Bolun Zhang, Hao Guan, Yaoming Zhu, Lin Qiu, Weiwen Liu, Xuezhi Cao, Xunliang Cai, Weinan Zhang, Yong Yu,
Abstract要約: PRDBenchは、20のドメインにわたる50の現実のPythonプロジェクトからなる、新しいベンチマークである。それぞれに構造化された製品要求文書(PRD)要件、包括的な評価基準、リファレンス実装がある。我々はエージェント・アズ・ア・ジャッジ(Agen-as-a-Judge)パラダイムを用いてエージェントの出力を評価する。
参考スコア（独自算出の注目度）: 47.85891728056131
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in code agents have enabled automated software development at the project level, supported by large language models (LLMs) and widely adopted tools. However, existing benchmarks for code agent evaluation face two major limitations: high annotation cost and expertise requirements, and rigid evaluation metrics that rely primarily on unit tests. To address these challenges, we propose an agent-driven benchmark construction pipeline that leverages human supervision to efficiently generate diverse and challenging project-level tasks. Based on this approach, we introduce PRDBench, a novel benchmark comprising 50 real-world Python projects across 20 domains, each with structured Product Requirement Document (PRD) requirements, comprehensive evaluation criteria, and reference implementations. PRDBench features rich data sources, high task complexity, and flexible metrics. We further employ an Agent-as-a-Judge paradigm to score agent outputs, enabling the evaluation of various test types beyond unit tests. Extensive experiments on PRDBench demonstrate its effectiveness in assessing the capabilities of both code agents and evaluation agents, providing a scalable and robust framework for annotation and evaluation.
Abstract（参考訳）: コードエージェントの最近の進歩は、大規模言語モデル(LLM)と広く採用されているツールによってサポートされている、プロジェクトレベルでのソフトウェア開発の自動化を可能にしている。しかし、コードエージェント評価のための既存のベンチマークは、高いアノテーションコストと専門知識要件、そして主に単体テストに依存する厳格な評価指標の2つの大きな制限に直面しています。これらの課題に対処するために,人間の監督を利用して多種多様なプロジェクトレベルのタスクを効率的に生成するエージェント駆動型ベンチマーク構築パイプラインを提案する。 PRDBenchは、20のドメインにわたる50の現実のPythonプロジェクトからなる新しいベンチマークであり、それぞれに構造化された製品要求文書(PRD)要件、包括的な評価基準、参照実装がある。 PRDBenchは、豊富なデータソース、高いタスク複雑性、柔軟なメトリクスを備えている。さらにエージェント・アズ・ア・ジャッジ(Agen-as-a-Judge)パラダイムを用いてエージェントの出力をスコアし、ユニットテスト以外の様々なテストタイプの評価を可能にする。 PRDBenchに関する大規模な実験は、コードエージェントと評価エージェントの両方の機能を評価する上での有効性を示し、アノテーションと評価のためのスケーラブルで堅牢なフレームワークを提供する。

論文の概要: Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation

関連論文リスト