Related papers: ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development

ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development

URL: http://arxiv.org/abs/2602.01655v2
Date: Mon, 09 Feb 2026 15:17:29 GMT
Title: ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development
Authors: Pengrui Lu, Shiqi Zhang, Yunzhong Hou, Lyumanshan Ye, Chaoyi Huang, Zixi Chen, Ji Zeng, Hantao Jiang, Pengfei Liu, Yiwei Wang, Ming-Hsuan Yang,
Abstract summary: ProjDevBench is an end-to-end benchmark that provides project requirements to coding agents and evaluates the resulting repositories.<n>We curate 20 programming problems across 8 categories, covering both concept-oriented tasks and real-world application scenarios.<n>Our evaluation reports an overall acceptance rate of 27.38%: agents handle basic functionality but struggle with complex system design, time optimization, and resource management.
Score: 49.63491095660809
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent coding agents can generate complete codebases from simple prompts, yet existing evaluations focus on issue-level bug fixing and lag behind end-to-end development. We introduce ProjDevBench, an end-to-end benchmark that provides project requirements to coding agents and evaluates the resulting repositories. Combining Online Judge (OJ) testing with LLM-assisted code review, the benchmark evaluates agents on (1) system architecture design, (2) functional correctness, and (3) iterative solution refinement. We curate 20 programming problems across 8 categories, covering both concept-oriented tasks and real-world application scenarios, and evaluate six coding agents built on different LLM backends. Our evaluation reports an overall acceptance rate of 27.38%: agents handle basic functionality and data structures but struggle with complex system design, time complexity optimization, and resource management. Our benchmark is available at https://github.com/zsworld6/projdevbench.

Related papers

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development [72.4729759618632]
We introduce ABC-Bench, a benchmark to evaluate agentic backend coding within a realistic, executable workflow.<n>We curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories.<n>Our evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks.
arXiv Detail & Related papers (2026-01-16T08:23:52Z)
LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering [90.84806758077536]
We introduce textbfLoCoBench-Agent, a comprehensive evaluation framework specifically designed to assess large language models (LLMs) agents in realistic, long-context software engineering.<n>Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations.<n>Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens.
arXiv Detail & Related papers (2025-11-17T23:57:24Z)
Towards Realistic Project-Level Code Generation via Multi-Agent Collaboration and Semantic Architecture Modeling [7.753074942497876]
We introduce CodeProjectEval, a project-level code generation dataset built from 18 real-world repositories with 12.7 files and 2,388.6 lines of code per task.<n>We propose ProjectGen, a multi-agent framework that decomposes projects into architecture design, skeleton generation, and code filling stages.<n>Experiments show that ProjectGen achieves state-of-the-art performance, passing 52/124 test cases on the small-scale project-level code generation dataset DevBench.
arXiv Detail & Related papers (2025-11-05T12:12:35Z)
A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System [56.40989626804489]
This survey provides the first holistic analysis of Large Language Models-powered software engineering.<n>We review over 150 recent papers and propose a taxonomy along two key dimensions: (1) Solutions, categorized into prompt-based, fine-tuning-based, and agent-based paradigms, and (2) Benchmarks, including tasks such as code generation, translation, and repair.
arXiv Detail & Related papers (2025-10-10T06:56:50Z)
FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding [11.846768103642583]
FeatBench is a novel benchmark for vibe coding that focuses on feature implementation.<n> FeatBench is built on a multi-level filtering pipeline to ensure quality and a fully automated pipeline to evolve the benchmark.<n>Our evaluation reveals that feature implementation within the vibe coding paradigm is a significant challenge, with the highest success rate of only 29.94%.
arXiv Detail & Related papers (2025-09-26T11:47:50Z)
LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering [85.58151741052616]
LoCoBench is a benchmark specifically designed to evaluate long-context LLMs in realistic, complex software development scenarios.<n>Our benchmark provides 8,000 evaluation scenarios systematically generated across 10 programming languages.<n>LoCoBench introduces 8 task categories that capture essential long-context understanding capabilities.
arXiv Detail & Related papers (2025-09-11T16:55:04Z)
GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging [41.754784344572286]
We release GitTaskBench, a benchmark for evaluating code agents in real-world scenarios.<n>Each task pairs a relevant repository with an automated, human-curated evaluation harness.<n>We also propose the alpha-value metric to quantify the economic benefit of agent performance.
arXiv Detail & Related papers (2025-08-26T12:48:05Z)
DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation [31.237236649603123]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in automated front-end engineering.<n>DesignBench is a benchmark for assessing MLLMs' capabilities in automated front-end engineering.
arXiv Detail & Related papers (2025-06-06T17:21:21Z)
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation [26.14778133391999]
FEA-Bench is a benchmark designed to assess the ability of large language models to perform incremental development within code repositories.<n>We collect pull requests from 83 GitHub repositories and use rule-based and intent-based filtering to construct task instances focused on new feature development.
arXiv Detail & Related papers (2025-03-09T16:11:57Z)
Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework. Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z)
Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study [72.24266814625685]
We explore the performance of large language models (LLMs) across the entire software development lifecycle with DevEval.<n>DevEval features four programming languages, multiple domains, high-quality data collection, and carefully designed and verified metrics for each task.<n> Empirical studies show that current LLMs, including GPT-4, fail to solve the challenges presented within DevEval.
arXiv Detail & Related papers (2024-03-13T15:13:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.