Related papers: E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task

E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task

URL: http://arxiv.org/abs/2510.14509v2
Date: Fri, 24 Oct 2025 07:13:11 GMT
Title: E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task
Authors: Jingyao Liu, Chen Huang, Zhizhao Guan, Wenqiang Lei, Yang Deng,
Abstract summary: We present E2EDev, a novel benchmark grounded in the principles of Behavior-Driven Development (BDD)<n>E2EDev comprises (i) a fine-grained set of user requirements, (ii) multiple BDD test scenarios with corresponding Python step implementations for each requirement, and (iii) a fully automated testing pipeline built on the Behave framework.
Score: 40.46045741731215
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid advancement in large language models (LLMs) has demonstrated significant potential in End-to-End Software Development (E2ESD). However, existing E2ESD benchmarks are limited by coarse-grained requirement specifications and unreliable evaluation protocols, hindering a true understanding of current framework capabilities. To address these limitations, we present E2EDev, a novel benchmark grounded in the principles of Behavior-Driven Development (BDD), which evaluates the capabilities of E2ESD frameworks by assessing whether the generated software meets user needs through mimicking real user interactions (Figure 1). E2EDev comprises (i) a fine-grained set of user requirements, (ii) multiple BDD test scenarios with corresponding Python step implementations for each requirement, and (iii) a fully automated testing pipeline built on the Behave framework. To ensure its quality while reducing the annotation effort, E2EDev leverages our proposed Human-in-the-Loop Multi-Agent Annotation Framework (HITL-MAA). By evaluating various E2ESD frameworks and LLM backbones with E2EDev, our analysis reveals a persistent struggle to effectively solve these tasks, underscoring the critical need for more effective and cost-efficient E2ESD solutions. Our codebase and benchmark are publicly available at https://github.com/SCUNLP/E2EDev.

Related papers

APEX-SWE [4.927317067589892]
We introduce the AI Productivity Index for Software Engineering (APEX-SWE)<n>APEX-SWE is a benchmark for assessing whether frontier AI models can execute economically valuable software engineering work.<n> Gemini 3 Pro (Thinking = High) performs best, with a Pass@1 score of 25%.
arXiv Detail & Related papers (2026-01-13T18:44:08Z)
GenIA-E2ETest: A Generative AI-Based Approach for End-to-End Test Automation [0.3499870393443268]
This paper introduces GenIA-E2ETest, which leverages generative AI to generate E2E test scripts from natural language descriptions automatically.<n>We evaluated the approach on two web applications, assessing completeness, correctness, adaptation effort, and robustness.
arXiv Detail & Related papers (2025-10-01T15:30:24Z)
OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning [50.45036742963495]
We introduce OmniEVA, an embodied versatile planner that enables advanced embodied reasoning and task planning.<n>A Task-Adaptive 3D Grounding mechanism enables context-aware 3D grounding for diverse embodied tasks.<n>An Embodiment-Aware Reasoning framework incorporates task goals and embodiment constraints into the reasoning loop, resulting in planning decisions that are both goal-directed and executable.
arXiv Detail & Related papers (2025-09-11T10:32:22Z)
InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling [71.37579508777843]
Large language models (LLMs) have revolutionized artificial intelligence by enabling complex reasoning capabilities.<n>To address this gap, we present InternBootcamp, an open-source framework comprising 1000+ domain-diverse task environments.
arXiv Detail & Related papers (2025-08-12T05:00:00Z)
SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner [53.54568352375669]
We introduce **SWE-Flow**, a novel data synthesis framework grounded in Test-Driven Development (TDD)<n>Unlike existing software engineering data that rely on human-submitted issues, **SWE-Flow** automatically infers incremental development steps directly from unit tests.<n>We generated 16,061 training instances and 2,020 test instances from real-world GitHub projects, creating the **SWE-Flow-Eval** benchmark.
arXiv Detail & Related papers (2025-06-10T17:23:33Z)
Feature-Driven End-To-End Test Generation [5.7340627516257525]
AutoE2E is a novel approach to automate the generation of semantically meaningful feature-driven E2E test cases for web applications.<n>E2EBench is a new benchmark for automatically assessing the feature coverage of E2E test suites.
arXiv Detail & Related papers (2024-08-04T01:16:04Z)
R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models [51.468732121824125]
Large language models have achieved remarkable success on general NLP tasks, but they may fall short for domain-specific problems. Existing evaluation tools only provide a few baselines and evaluate them on various domains without mining the depth of domain knowledge. In this paper, we address the challenges of evaluating RALLMs by introducing the R-Eval toolkit, a Python toolkit designed to streamline the evaluation of different RAGs.
arXiv Detail & Related papers (2024-06-17T15:59:49Z)
Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving [59.705635382104454]
We present Bench2Drive, the first benchmark for evaluating E2E-AD systems' multiple abilities in a closed-loop manner.<n>We implement state-of-the-art E2E-AD models and evaluate them in Bench2Drive, providing insights regarding current status and future directions.
arXiv Detail & Related papers (2024-06-06T09:12:30Z)
EmoBench: Evaluating the Emotional Intelligence of Large Language Models [73.60839120040887]
EmoBench is a benchmark that draws upon established psychological theories and proposes a comprehensive definition for machine Emotional Intelligence (EI) EmoBench includes a set of 400 hand-crafted questions in English and Chinese, which are meticulously designed to require thorough reasoning and understanding. Our findings reveal a considerable gap between the EI of existing Large Language Models and the average human, highlighting a promising direction for future research.
arXiv Detail & Related papers (2024-02-19T11:48:09Z)
E2E-AT: A Unified Framework for Tackling Uncertainty in Task-aware End-to-end Learning [9.741277008050927]
We propose a unified framework that covers the uncertainties emerging in both the input feature space of the machine learning models and the constrained optimization models. We show that neglecting the uncertainty of COs during training causes a new trigger for generalization errors. The framework is described as a robust optimization problem and is practically solved via end-to-end adversarial training (E2E-AT)
arXiv Detail & Related papers (2023-12-17T02:23:25Z)
E2S2: Encoding-Enhanced Sequence-to-Sequence Pretraining for Language Understanding and Generation [95.49128988683191]
Sequence-to-sequence (seq2seq) learning is a popular fashion for large-scale pretraining language models. We propose an encoding-enhanced seq2seq pretraining strategy, namely E2S2. E2S2 improves the seq2seq models via integrating more efficient self-supervised information into the encoders.
arXiv Detail & Related papers (2022-05-30T08:25:36Z)
Consistent Training and Decoding For End-to-end Speech Recognition Using Lattice-free MMI [67.13999010060057]
We propose a novel approach to integrate LF-MMI criterion into E2E ASR frameworks in both training and decoding stages. Experiments suggest that the introduction of the LF-MMI criterion consistently leads to significant performance improvements.
arXiv Detail & Related papers (2021-12-05T07:30:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.