Fugu-MT 論文翻訳(概要): EvoCodeBench: A Human-Performance Benchmark for Self-Evolving LLM-Driven Coding Systems

論文の概要: EvoCodeBench: A Human-Performance Benchmark for Self-Evolving LLM-Driven Coding Systems

arxiv url: http://arxiv.org/abs/2602.10171v1
Date: Tue, 10 Feb 2026 14:04:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-12 21:44:01.215882
Title: EvoCodeBench: A Human-Performance Benchmark for Self-Evolving LLM-Driven Coding Systems
Title（参考訳）: EvoCodeBench: LLM駆動の自己進化型符号化システムのためのヒューマンパフォーマンスベンチマーク
Authors: Wentao Zhang, Jianfeng Wang, Liheng Liang, Yilei Zhao, HaiBin Wen, Zhe Zhao,
Abstract要約: 大規模言語モデル(LLM)は、ワンショットコード生成から推論時に反復的な改善が可能な複雑なシステムへと進化してきた。 EvoCodeBench(エボCodeBench)は、プログラミング言語間で自己進化するLLM駆動型コーディングシステムを評価するためのベンチマークである。その結果, 自己進化システムは時間とともに効率が向上し, 人間の相対的・多言語的分析は, 精度だけでは不可能な洞察を与えることがわかった。
参考スコア（独自算出の注目度）: 24.49186459186861
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) continue to advance in programming tasks, LLM-driven coding systems have evolved from one-shot code generation into complex systems capable of iterative improvement during inference. However, existing code benchmarks primarily emphasize static correctness and implicitly assume fixed model capability during inference. As a result, they do not capture inference-time self-evolution, such as whether accuracy and efficiency improve as an agent iteratively refines its solutions. They also provide limited accounting of resource costs and rarely calibrate model performance against that of human programmers. Moreover, many benchmarks are dominated by high-resource languages, leaving cross-language robustness and long-tail language stability underexplored. Therefore, we present EvoCodeBench, a benchmark for evaluating self-evolving LLM-driven coding systems across programming languages with direct comparison to human performance. EvoCodeBench tracks performance dynamics, measuring solution correctness alongside efficiency metrics such as solving time, memory consumption, and improvement algorithmic design over repeated problem-solving attempts. To ground evaluation in a human-centered reference frame, we directly compare model performance with that of human programmers on the same tasks, enabling relative performance assessment within the human ability distribution. Furthermore, EvoCodeBench supports multiple programming languages, enabling systematic cross-language and long-tail stability analyses under a unified protocol. Our results demonstrate that self-evolving systems exhibit measurable gains in efficiency over time, and that human-relative and multi-language analyses provide insights unavailable through accuracy alone. EvoCodeBench establishes a foundation for evaluating coding intelligence in evolving LLM-driven systems.
Abstract（参考訳）: 大規模言語モデル(LLM)がプログラミングタスクの進歩を続けるにつれて、LLM駆動のコーディングシステムはワンショットコード生成から推論の繰り返し改善が可能な複雑なシステムへと進化してきた。しかし、既存のコードベンチマークは主に静的な正確さを強調し、推論中に暗黙的に固定モデルの能力を仮定する。結果として、エージェントが反復的に解を洗練して精度と効率が向上するかどうかなど、推論時の自己進化を捉えない。また、リソースコストの限定的な説明も提供し、人間のプログラマとモデルパフォーマンスの調整はめったに行われない。さらに、多くのベンチマークは高リソース言語に支配されており、クロスランゲージの堅牢性と長い尾の言語の安定性が過小評価されている。そこで本研究では,プログラム言語間で自己進化型LLM駆動型プログラミングシステムを評価するためのベンチマークであるEvoCodeBenchについて,人的性能と直接比較した。 EvoCodeBenchは、繰り返し発生する問題解決の試みよりも、時間、メモリ消費、アルゴリズム設計の改善といった効率指標とともに、パフォーマンスのダイナミクス、ソリューションの正確性を測定する。人中心の参照フレームで評価を行うため、モデル性能と人間プログラマのタスクを直接比較し、人間の能力分布内で相対的な性能評価を可能にする。さらに、EvoCodeBenchは複数のプログラミング言語をサポートし、統一されたプロトコルの下で、体系的なクロスランゲージとロングテールの安定性解析を可能にする。その結果, 自己進化システムは時間とともに効率が向上し, 人間の相対的・多言語的分析は, 精度だけでは不可能な洞察を与えることがわかった。 EvoCodeBenchは、LLM駆動システムの進化において、コーディングインテリジェンスを評価する基盤を確立する。

論文の概要: EvoCodeBench: A Human-Performance Benchmark for Self-Evolving LLM-Driven Coding Systems

関連論文リスト