Related papers: Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development

Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development

URL: http://arxiv.org/abs/2511.04064v1
Date: Thu, 06 Nov 2025 05:10:04 GMT
Title: Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development
Authors: Zhengran Zeng, Yixin Li, Rui Xie, Wei Ye, Shikun Zhang,
Abstract summary: Development of LLM-based autonomous agents for end-to-end software development represents a significant paradigm shift in software engineering.<n>This work provides the community with a more realistic benchmark, a comprehensive evaluation framework, and crucial insights into the current capabilities and core challenges of software development agents.
Score: 33.01897134024342
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The development of LLM-based autonomous agents for end-to-end software development represents a significant paradigm shift in software engineering. However, the scientific evaluation of these systems is hampered by significant challenges, including overly simplistic benchmarks and the difficulty of conducting fair comparisons between different agent architectures due to confounding implementation variables. To address these limitations, we first construct a challenging and dynamically curated E2EDevBench to simulate realistic development scenarios. Second, we propose a hybrid evaluation framework that combines test-case-based functional assessment with fine-grained, LLM-based requirement verification. Using this framework, we conduct a controlled empirical study on three representative agent architectures implemented upon a unified foundation to isolate the impact of workflow design. Our findings reveal that state-of-the-art agents can fulfill approximately 50\% of requirements on \bench{}, but their success is critically dependent on the architectural strategy for task decomposition and collaboration. Furthermore, our analysis indicates that the primary bottleneck is the omission of requirements and inadequate self-verification. This work provides the community with a more realistic benchmark, a comprehensive evaluation framework, and crucial insights into the current capabilities and core challenges of software development agents, guiding future research toward enhancing requirement comprehension and planning.

Related papers

Toward Formalizing LLM-Based Agent Designs through Structural Context Modeling and Semantic Dynamics Analysis [13.919694566467053]
We argue that this fragmentation stems from the absence of an analyzable, self-consistent formal model that enables characterization and comparison of LLM agents.<n>To address this gap, we propose the texttt Structural Context Model, a formal model for analyzing and comparing LLM agents from the perspective of context structure.<n>We demonstrate the effectiveness of the complete framework on dynamic variants of the monkey-banana problem, where agents engineered using our approach achieve up to a 32 percentage points improvement in success rate.
arXiv Detail & Related papers (2026-02-09T05:15:11Z)
Understanding Multi-Agent LLM Frameworks: A Unified Benchmark and Experimental Analysis [2.903627214446312]
We introduce an architectural taxonomy for systematically comparing multi-agent LLM frameworks along fundamental dimensions.<n>We develop a unified evaluation suite that integrates existing benchmarks under a standardized execution pipeline.<n>Our results show that framework-level design choices alone can increase latency by over 100x, reduce planning accuracy by up to 30%, and lower coordination success from above 90% to below 30%.
arXiv Detail & Related papers (2026-02-03T05:37:56Z)
EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots [68.29056647487519]
Embodied AI is fueled by high-fidelity simulation and large-scale data collection.<n>However, this scaling capability remains bottlenecked by a reliance on labor-intensive manual oversight.<n>We introduce textscEmboCoach-Bench, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies.
arXiv Detail & Related papers (2026-01-29T11:33:49Z)
Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey [59.3507264893654]
Issue resolution is a complex Software Engineering task integral to real-world development.<n> benchmarks like SWE-bench revealed this task as profoundly difficult for large language models.<n>This paper presents a systematic survey of this emerging domain.
arXiv Detail & Related papers (2026-01-15T18:55:03Z)
LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering [90.84806758077536]
We introduce textbfLoCoBench-Agent, a comprehensive evaluation framework specifically designed to assess large language models (LLMs) agents in realistic, long-context software engineering.<n>Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations.<n>Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens.
arXiv Detail & Related papers (2025-11-17T23:57:24Z)
A Survey of Vibe Coding with Large Language Models [93.88284590533242]
"Vibe Coding" is a development methodology where developers validate AI-generated implementations through outcome observation.<n>Despite its transformative potential, the effectiveness of this emergent paradigm remains under-explored.<n>This survey provides the first comprehensive and systematic review of Vibe Coding with large language models.
arXiv Detail & Related papers (2025-10-14T11:26:56Z)
A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System [56.40989626804489]
This survey provides the first holistic analysis of Large Language Models-powered software engineering.<n>We review over 150 recent papers and propose a taxonomy along two key dimensions: (1) Solutions, categorized into prompt-based, fine-tuning-based, and agent-based paradigms, and (2) Benchmarks, including tasks such as code generation, translation, and repair.
arXiv Detail & Related papers (2025-10-10T06:56:50Z)
Agentic AI Reasoning for Mobile Edge General Intelligence: Fundamentals, Approaches, and Directions [74.35421055079655]
Large language models (LLMs) have enabled an emergence of agentic artificial intelligence (AI) with powerful reasoning and autonomous decision-making capabilities.<n>Mobile Edge General Intelligence (MEGI) brings real-time, privacy-preserving reasoning to the network edge.<n>We propose a joint optimization framework for efficient LLM reasoning deployment in MEGI.
arXiv Detail & Related papers (2025-09-27T10:53:48Z)
Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research [32.92036657863354]
Language agents powered by large language models (LLMs) have demonstrated remarkable capabilities in understanding, reasoning, and executing complex tasks.<n>However, developing robust agents presents significant challenges: substantial engineering overhead, lack of standardized components, and insufficient evaluation frameworks for fair comparison.<n>We introduce Agent Graph-based Orchestration for Reasoning and Assessment (AGORA), a flexible and abstraction framework that addresses these challenges.
arXiv Detail & Related papers (2025-05-30T08:46:23Z)
Assessing LLMs for Front-end Software Architecture Knowledge [0.0]
Large Language Models (LLMs) have demonstrated significant promise in automating software development tasks.<n>This study investigates the capabilities of an LLM in understanding, reproducing, and generating structures within the VIPER architecture.<n> Experimental results, using ChatGPT 4 Turbo 2024-04-09, reveal that the LLM excelled in higher-order tasks like evaluating and creating, but faced challenges with lower-order tasks requiring precise retrieval of architectural details.
arXiv Detail & Related papers (2025-02-26T19:33:35Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.