Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents
- URL: http://arxiv.org/abs/2510.25694v1
- Date: Wed, 29 Oct 2025 16:59:07 GMT
- Title: Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents
- Authors: Jiayi Kuang, Yinghui Li, Xin Zhang, Yangning Li, Di Yin, Xing Sun, Ying Shen, Philip S. Yu,
- Abstract summary: Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck.<n>Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail.<n>We introduce Enconda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning.
- Score: 71.85020581835042
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck due to heavy manual effort and scarce large-scale, high-quality datasets. Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail. We introduce the Environment Configuration Diagnosis Benchmark, Enconda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning, perception-driven error diagnosis, feedback-driven repair, and action to execute final environment configuration. Our task instances are automatically constructed by injecting realistic README errors and are validated in Docker for scalable, high-quality evaluation. Enconda-bench combines process-level analysis with end-to-end executability to enable capability assessments beyond aggregate success rates. Evaluations across state-of-the-art LLMs and agent frameworks show that while agents can localize errors, they struggle to translate feedback into effective corrections, limiting end-to-end performance. To our knowledge, Enconda-bench is the first framework to provide process-level internal capability assessment for environment configuration, offering actionable insights for improving software engineering agents.
Related papers
- AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition [72.24180896265192]
We introduce AgentNoiseBench, a framework for evaluating robustness of agentic models under noisy environments.<n>We first conduct an in-depth analysis of biases and uncertainties in real-world scenarios.<n>We then categorize environmental noise into two primary types: user-noise and tool-noise.<n>Building on this analysis, we develop an automated pipeline that injects controllable noise into existing agent-centric benchmarks.
arXiv Detail & Related papers (2026-02-11T20:33:10Z) - HerAgent: Rethinking the Automated Environment Deployment via Hierarchical Test Pyramid [15.944450159856602]
We argue that environment setup success should be evaluated through executable evidence rather than a single binary signal.<n>We propose HerAgent, an automated environment setup approach that incrementally constructs executable environments.
arXiv Detail & Related papers (2026-02-08T08:57:05Z) - Environment-Aware Code Generation: How far are We? [52.69113158357018]
It is unclear whether large language models (LLMs) can reliably generate executable code tailored to a user's specific environment.<n>We present the first systematic study of Environment-Aware Code Generation (EACG), where generated code must be functionally correct and directly executable under arbitrary software configurations.<n>Our results show that current LLMs struggle with environment-specific code generation, while our adaptations improve environment compatibility and executability.
arXiv Detail & Related papers (2026-01-18T04:58:15Z) - ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development [72.4729759618632]
We introduce ABC-Bench, a benchmark to evaluate agentic backend coding within a realistic, executable workflow.<n>We curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories.<n>Our evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks.
arXiv Detail & Related papers (2026-01-16T08:23:52Z) - OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models [54.44308299945632]
We introduce OS-Oracle that makes three core contributions: a scalable data pipeline for cross-platform GUI critic data; a two-stage training paradigm combining supervised fine-tuning and consistency-preserving group relative policy optimization; and OS-Critic Bench, a holistic benchmark for evaluating critic model performance across Mobile, Web, and Desktop platforms.<n>The resulting critic model, OS-Oracle-7B, achieves state-of-the-art performance among open-source VLMs on OS-Critic Bench, and surpasses proprietary models on the mobile domain.
arXiv Detail & Related papers (2025-12-18T08:29:50Z) - Multi-Docker-Eval: A `Shovel of the Gold Rush' Benchmark on Automatic Environment Building for Software Engineering [38.724704918577295]
Multi-Docker-Eval benchmark includes 40 real-world repositories spanning 9 programming languages.<n>Overall success rate of current models is low (F2P at most 37.7%), with environment construction being the primary bottleneck.<n>These findings provide actionable guidelines for building scalable, fully automated SWE pipelines.
arXiv Detail & Related papers (2025-12-07T16:43:45Z) - Detecting Pipeline Failures through Fine-Grained Analysis of Web Agents [0.48156730450374763]
This work analyzes existing benchmarks and highlights the lack of fine-grained diagnostic tools.<n>We propose a modular evaluation framework that decomposes agent pipelines into interpretable stages for detailed error analysis.
arXiv Detail & Related papers (2025-09-17T19:34:49Z) - Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments [70.42705564227548]
We propose an automated environment construction pipeline for large language models (LLMs)<n>This enables the creation of high-quality training environments that provide detailed and measurable feedback without relying on external tools.<n>We also introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution.
arXiv Detail & Related papers (2025-08-12T09:45:19Z) - Automated Validation of LLM-based Evaluators for Software Engineering Artifacts [0.7548538278943616]
REFINE (Ranking Evaluators for FIne grained Nuanced Evaluation) is an automated framework for benchmarking large language models (LLMs)<n> REFINE applies novel generation techniques to automatically synthesize artifacts with progressively reduced quality.<n>It quantifies each candidate evaluator configuration by measuring how closely its rankings align with expected ordering.
arXiv Detail & Related papers (2025-08-04T18:52:01Z) - MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models [76.72220653705679]
We introduce MCPEval, an open-source framework that automates end-to-end task generation and deep evaluation of intelligent agents.<n> MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines.<n> Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance.
arXiv Detail & Related papers (2025-07-17T05:46:27Z) - SetupBench: Assessing Software Engineering Agents' Ability to Bootstrap Development Environments [2.184775414778289]
We introduce setupbench, a benchmark that isolates the environment-bootstrap skill.<n>Our tasks span seven language ecosystems, five database engines, and multi-service orchestration scenarios.<n>We find low success rates across task categories, with particular challenges in repository setup (38.9-57.4%) and local database configuration (20.0-53.3%)
arXiv Detail & Related papers (2025-07-11T22:45:07Z) - SWE-bench Goes Live! [39.295587503671015]
We present SWE-bench-Live, a live-updatable benchmark for large language models (LLMs)<n>Our initial release consists of 1,319 tasks derived from real GitHub issues created since 2024, spanning 93 repositories.<n>Central to our benchmark is method, an automated curation pipeline that streamlines the entire process from instance creation to environment setup.
arXiv Detail & Related papers (2025-05-29T13:09:44Z) - FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks [52.47895046206854]
FieldWorkArena is a benchmark for agentic AI targeting real-world field work.<n>This paper defines a new action space that agentic AI should possess for real world work environment benchmarks.
arXiv Detail & Related papers (2025-05-26T08:21:46Z) - On the Impacts of Contexts on Repository-Level Code Generation [5.641402231731082]
We present RepoExec, a novel benchmark designed to evaluate repository-level code generation.<n>We focus on three key aspects: executability, functional correctness through comprehensive test case generation, and accurate utilization of cross-file contexts.
arXiv Detail & Related papers (2024-06-17T10:45:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.