Related papers: HerAgent: Rethinking the Automated Environment Deployment via Hierarchical Test Pyramid

HerAgent: Rethinking the Automated Environment Deployment via Hierarchical Test Pyramid

URL: http://arxiv.org/abs/2602.07871v2
Date: Fri, 13 Feb 2026 09:24:18 GMT
Title: HerAgent: Rethinking the Automated Environment Deployment via Hierarchical Test Pyramid
Authors: Xiang Li, Siyu Lu, Federica Sarro, Claire Le Goues, He Ye,
Abstract summary: We argue that environment setup success should be evaluated through executable evidence rather than a single binary signal.<n>We propose HerAgent, an automated environment setup approach that incrementally constructs executable environments.
Score: 15.944450159856602
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated software environment setup is a prerequisite for testing, debugging, and reproducing failures, yet remains challenging in practice due to complex dependencies, heterogeneous build systems, and incomplete documentation. Recent work leverages large language models to automate this process, but typically evaluates success using weak signals such as dependency installation or partial test execution, which do not ensure that a project can actually run. In this paper, we argue that environment setup success should be evaluated through executable evidence rather than a single binary signal. We introduce the Environment Maturity Hierarchy, which defines three success levels based on progressively stronger execution requirements, culminating in successful execution of a project's main entry point. Guided by this hierarchy, we propose HerAgent, an automated environment setup approach that incrementally constructs executable environments through execution-based validation and repair. We evaluate HerAgent on four public benchmarks, where it outperforms all related work, achieving up to 79.6\% improvement due to its holistic understanding of project structure and dependencies. On complex C/C++ projects, HerAgent surpasses prior approaches by 66.7\%. In addition, HerAgent uniquely resolves 11-30 environment instances across the benchmarks that no prior method can configure.

Related papers

EvoConfig: Self-Evolving Multi-Agent Systems for Efficient Autonomous Environment Configuration [44.95469898974659]
EvoConfig is an efficient environment configuration framework that optimize multi-agent collaboration to build correct runtime environments.<n>It features an expert diagnosis module for fine-grained post-execution analysis, and a self-evolving mechanism that lets expert agents self-feedback and adjust dynamically error-fixing priorities.<n>EvoConfig matches the previous state-of-the-art Repo2Run on Repo2Run's 420 repositories, while delivering clear gains on harder cases.
arXiv Detail & Related papers (2026-01-23T06:33:01Z)
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development [72.4729759618632]
We introduce ABC-Bench, a benchmark to evaluate agentic backend coding within a realistic, executable workflow.<n>We curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories.<n>Our evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks.
arXiv Detail & Related papers (2026-01-16T08:23:52Z)
What Do LLM Agents Know About Their World? Task2Quiz: A Paradigm for Studying Environment Understanding [50.35012849818872]
Large language model (LLM) agents have demonstrated remarkable capabilities in complex decision-making and tool-use tasks.<n>We propose Task-to-Quiz (T2Q), a deterministic and automated evaluation paradigm designed to decouple task execution from world-state understanding.<n>Our experiments reveal that task success is often a poor proxy for environment understanding, and that current memory machanism can not effectively help agents acquire a grounded model of the environment.
arXiv Detail & Related papers (2026-01-14T14:09:11Z)
Multi-Docker-Eval: A `Shovel of the Gold Rush' Benchmark on Automatic Environment Building for Software Engineering [38.724704918577295]
Multi-Docker-Eval benchmark includes 40 real-world repositories spanning 9 programming languages.<n>Overall success rate of current models is low (F2P at most 37.7%), with environment construction being the primary bottleneck.<n>These findings provide actionable guidelines for building scalable, fully automated SWE pipelines.
arXiv Detail & Related papers (2025-12-07T16:43:45Z)
Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents [71.85020581835042]
Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck.<n>Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail.<n>We introduce Enconda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning.
arXiv Detail & Related papers (2025-10-29T16:59:07Z)
PIPer: On-Device Environment Setup via Online Reinforcement Learning [74.52354321028493]
Automated environment setup methods could assist developers by providing fully configured environments for arbitrary repositories without manual effort.<n>Recent studies reveal that even state-of-the-art Large Language Models (LLMs) achieve limited success in automating this task.<n>We combine supervised fine-tuning for generating correct scripts and Reinforcement Learning with Verifiable Rewards (RLVR) to adapt it to the task of environment setup.<n>On EnvBench-Python, our method enables Qwen3-8B (a model runnable on consumer hardware) to perform on par with larger models-Qwen3-32B and GPT-4
arXiv Detail & Related papers (2025-09-29T20:03:05Z)
SetupBench: Assessing Software Engineering Agents' Ability to Bootstrap Development Environments [2.184775414778289]
We introduce setupbench, a benchmark that isolates the environment-bootstrap skill.<n>Our tasks span seven language ecosystems, five database engines, and multi-service orchestration scenarios.<n>We find low success rates across task categories, with particular challenges in repository setup (38.9-57.4%) and local database configuration (20.0-53.3%)
arXiv Detail & Related papers (2025-07-11T22:45:07Z)
EnvBench: A Benchmark for Automated Environment Setup [76.02998475135824]
Large Language Models have enabled researchers to focus on practical repository-level tasks in software engineering domain.<n>Existing studies on environment setup introduce innovative agentic strategies, but their evaluation is often based on small datasets.<n>To address this gap, we introduce a comprehensive environment setup benchmark EnvBench.
arXiv Detail & Related papers (2025-03-18T17:19:12Z)
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC [98.82146219495792]
In this paper, we propose a hierarchical agent framework named PC-Agent.<n>From the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content.<n>From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture.
arXiv Detail & Related papers (2025-02-20T05:41:55Z)
Repo2Run: Automated Building Executable Environment for Code Repository at Scale [10.143091612327602]
We introduce Repo2Run, an agent aiming at automating the building of executable test environments for any repositories at scale.<n>Repo2Run iteratively builds the Docker image, runs unit tests based on the feedback of the building, and synthesizes the Dockerfile.<n>The resulting Dockerfile can then be used to create Docker container environments for running code and tests.
arXiv Detail & Related papers (2025-02-19T12:51:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.