Related papers: EvoCodeBench: A Human-Performance Benchmark for Self-Evolving LLM-Driven Coding Systems

EvoCodeBench: A Human-Performance Benchmark for Self-Evolving LLM-Driven Coding Systems

URL: http://arxiv.org/abs/2602.10171v1
Date: Tue, 10 Feb 2026 14:04:22 GMT
Title: EvoCodeBench: A Human-Performance Benchmark for Self-Evolving LLM-Driven Coding Systems
Authors: Wentao Zhang, Jianfeng Wang, Liheng Liang, Yilei Zhao, HaiBin Wen, Zhe Zhao,
Abstract summary: Large language models (LLMs) have evolved from one-shot code generation into complex systems capable of iterative improvement during inference.<n>EvoCodeBench is a benchmark for evaluating self-evolving LLM-driven coding systems across programming languages with direct comparison to human performance.<n>Our results demonstrate that self-evolving systems exhibit measurable gains in efficiency over time, and that human-relative and multi-language analyses provide insights unavailable through accuracy alone.
Score: 24.49186459186861
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) continue to advance in programming tasks, LLM-driven coding systems have evolved from one-shot code generation into complex systems capable of iterative improvement during inference. However, existing code benchmarks primarily emphasize static correctness and implicitly assume fixed model capability during inference. As a result, they do not capture inference-time self-evolution, such as whether accuracy and efficiency improve as an agent iteratively refines its solutions. They also provide limited accounting of resource costs and rarely calibrate model performance against that of human programmers. Moreover, many benchmarks are dominated by high-resource languages, leaving cross-language robustness and long-tail language stability underexplored. Therefore, we present EvoCodeBench, a benchmark for evaluating self-evolving LLM-driven coding systems across programming languages with direct comparison to human performance. EvoCodeBench tracks performance dynamics, measuring solution correctness alongside efficiency metrics such as solving time, memory consumption, and improvement algorithmic design over repeated problem-solving attempts. To ground evaluation in a human-centered reference frame, we directly compare model performance with that of human programmers on the same tasks, enabling relative performance assessment within the human ability distribution. Furthermore, EvoCodeBench supports multiple programming languages, enabling systematic cross-language and long-tail stability analyses under a unified protocol. Our results demonstrate that self-evolving systems exhibit measurable gains in efficiency over time, and that human-relative and multi-language analyses provide insights unavailable through accuracy alone. EvoCodeBench establishes a foundation for evaluating coding intelligence in evolving LLM-driven systems.

Related papers

Code Fingerprints: Disentangled Attribution of LLM-Generated Code [7.515488307576106]
We study the problem of model-level code attribution, which aims to determine the source LLM responsible for generated code.<n>We propose the Disentangled Code Attribution Network (DCAN), which separates Source-Agnostic semantic information from Source-Specific stylistic representations.<n>We construct the first large-scale benchmark dataset comprising code generated by four widely used Large Language Models (LLMs) across four programming languages.
arXiv Detail & Related papers (2026-03-04T15:58:36Z)
CelloAI Benchmarks: Toward Repeatable Evaluation of AI Assistants [2.2811622267552014]
Large Language Models (LLM) are increasingly used for software development.<n>Existing benchmarks for LLM-based coding assistance do not reflect the constraints of High Energy Physics and High Performance Computing software.<n>This paper develops practical, repeatable benchmarks that quantify LLM performance on HEP/ HPC-relevant tasks.
arXiv Detail & Related papers (2026-03-01T11:16:50Z)
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence [150.3696990310269]
Large language models (LLMs) have transformed automated software development by enabling direct translation of natural language descriptions into functional code.<n>We provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs.<n>We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder)
arXiv Detail & Related papers (2025-11-23T17:09:34Z)
Machine Learning Pipeline for Software Engineering: A Systematic Literature Review [0.0]
This systematic literature review examines state-of-the-art Machine Learning pipelines designed for software engineering (SE)<n>Our findings show that robust preprocessing, such as SMOTE for data balancing, improves model reliability.<n> Ensemble methods like Random Forest and Gradient Boosting dominate performance across tasks.<n>New metrics like Best Arithmetic Mean (BAM) are emerging in niche applications.
arXiv Detail & Related papers (2025-07-31T15:37:30Z)
Program Semantic Inequivalence Game with Large Language Models [20.43560028315856]
Large Language Models (LLMs) can achieve strong performance on everyday coding tasks, but they can fail on complex tasks that require non-trivial reasoning about program semantics.<n>In this work, we explore a method to synthetically generate code reasoning training data based on a semantic inequivalence game SInQ.<n>We prove that this setup enables theoretically unlimited improvement through self-play in the limit of infinite computational resources.
arXiv Detail & Related papers (2025-05-02T20:03:35Z)
Self-Evolving Critique Abilities in Large Language Models [59.861013614500024]
This paper explores enhancing critique abilities of Large Language Models (LLMs)<n>We introduce SCRIT, a framework that trains LLMs with self-generated data to evolve their critique abilities.<n>Our analysis reveals that SCRIT's performance scales positively with data and model size.
arXiv Detail & Related papers (2025-01-10T05:51:52Z)
Bridging the Language Gaps in Large Language Models with Inference-Time Cross-Lingual Intervention [71.12193680015622]
Large Language Models (LLMs) have shown remarkable capabilities in natural language processing. LLMs exhibit significant performance gaps among different languages. We propose Inference-Time Cross-Lingual Intervention (INCLINE) to overcome these limitations without incurring significant costs.
arXiv Detail & Related papers (2024-10-16T11:23:03Z)
The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers [44.28269395385471]
We study whether gains on existing benchmarks or more preferred LLM responses translate to programmer productivity when coding with LLMs. We introduce RealHumanEval, a web interface to measure the ability of LLMs to assist programmers. Despite static benchmarks not incorporating humans-in-the-loop, we find that improvements in benchmark performance lead to increased programmer productivity.
arXiv Detail & Related papers (2024-04-03T15:20:57Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
Exploring Data-Efficient Adaptation of Large Language Models for Code Generation [64.5583894165813]
We propose a novel adaptation approach named DEED, which stands for Data-Efficient adaptation with Error-Driven learning for code generation.<n> Experimental results show that, compared to other mainstream fine-tuning approaches, DEED achieves superior performance with few training data.
arXiv Detail & Related papers (2024-02-29T16:09:02Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE [62.13435256279566]
Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks. However, their large size makes their inference slow and computationally expensive. We show that it enables these layers to acquire 'good' generation ability without affecting the generation ability of the final layer.
arXiv Detail & Related papers (2023-10-28T04:07:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.