Agent-Based Software Artifact Evaluation
- URL: http://arxiv.org/abs/2602.02235v2
- Date: Tue, 03 Feb 2026 16:08:28 GMT
- Title: Agent-Based Software Artifact Evaluation
- Authors: Zhaonan Wu, Yanjie Zhao, Zhenpeng Chen, Zheng Wang, Haoyu Wang,
- Abstract summary: Artifact evaluation has been adopted in the Software Engineering (SE) research community for 15 years.<n>We propose ArtifactCopilot, the first end-to-end agent-based framework for automated artifact evaluation.
- Score: 15.526715803442746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Artifact evaluation has been adopted in the Software Engineering (SE) research community for 15 years, substantially improving research reproducibility across major SE conferences. However, this success has introduced a growing scalability challenge, as artifact evaluation relies heavily on reviewers' manual execution and debugging, leading to escalating human effort amid rapidly increasing paper submissions. To address this problem, we investigate automated artifact evaluation. We first conduct a preliminary study on artifacts from top-tier SE conferences and identify three key challenges: perceiving execution states, maintaining stable execution environments, and recovering from execution errors. Inspired by these findings, we propose ArtifactCopilot, the first end-to-end agent-based framework for automated artifact evaluation. ArtifactCopilot automates environment construction, instruction execution, and error recovery by combining an execution normalization strategy to ensure environment stability with an artifact evaluation graph that transforms README documents into dependency-aware command graphs, enabling structured execution planning, execution-state tracking, and error recovery. Evaluation on 48 real-world artifacts shows that ArtifactCopilot matches human artifact evaluation outcomes for 85.42% of the artifacts, outperforming Claude Code by 52.09 percentage points, while costing only \$0.091 per artifact on average and requiring zero human intervention for 45 out of 48 artifacts.
Related papers
- ResearchGym: Evaluating Language Model Agents on Real-World AI Research [48.46915933681714]
We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research.<n>To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL.<n>In a controlled evaluation of an agent powered by GPT-5, we observe a sharp capability--reliability gap.
arXiv Detail & Related papers (2026-02-16T19:00:03Z) - Artisan: Agentic Artifact Evaluation [14.265317773238529]
Artifact evaluation has become standard practice in the software engineering community to ensure the veracity of research results.<n>We present Artisan, an automated LLM agent for reproducing research results given a paper and its artifact.
arXiv Detail & Related papers (2026-02-10T18:15:48Z) - The State of Open Science in Software Engineering Research: A Case Study of ICSE Artifacts [2.5705703401045557]
There is a marked lack of studies that comprehensively examine the executability and rigor of replication packages in software engineering (SE) research.<n>We evaluate 100 replication packages published as part of ICSE proceedings over the past decade.<n>Our findings reveal that only 40% of the 100 artifacts evaluated were executable, of which 32.5% (13 out of 40) ran without any modification.
arXiv Detail & Related papers (2026-01-05T12:47:43Z) - Gesture Generation (Still) Needs Improved Human Evaluation Practices: Insights from a Community-Driven State-of-the-Art Benchmark [55.41250396114216]
We review human evaluation practices in automated, speech-driven 3D gesture generation.<n>We introduce a detailed human evaluation protocol for the widely-used BEAT2 motion-capture dataset.
arXiv Detail & Related papers (2025-11-03T05:17:28Z) - Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains [97.5573252172065]
We train a family of Automatic Reasoning Evaluators (FARE) with a simple iterative rejection-sampling supervised finetuning approach.<n>FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators.<n>As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH.
arXiv Detail & Related papers (2025-10-20T17:52:06Z) - ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation [51.297873393639456]
ArtifactsBench is a framework for automated visual code generation evaluation.<n>Our framework renders each generated artifact and captures its dynamic behavior through temporal screenshots.<n>We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading Large Language Models.
arXiv Detail & Related papers (2025-07-07T12:53:00Z) - An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks [15.820416019287622]
SE-Jury is the first evaluation metric for LLM-as-Ensemble-Judge.<n>We evaluate SE-Jury across a diverse set of software engineering (SE) benchmarks.
arXiv Detail & Related papers (2025-05-27T08:04:34Z) - ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark [0.0]
We introduce ToolComp, a benchmark designed to evaluate multi-step tool-use reasoning.<n>ToolComp is developed through a collaboration between models and human annotators.<n>We generate synthetic training data to compare the performance of outcome-supervised reward models with process-supervised reward models.
arXiv Detail & Related papers (2025-01-02T15:10:52Z) - TRIAD: Automated Traceability Recovery based on Biterm-enhanced
Deduction of Transitive Links among Artifacts [53.92293118080274]
Traceability allows stakeholders to extract and comprehend the trace links among software artifacts introduced across the software life cycle.
Most rely on textual similarities among software artifacts, such as those based on Information Retrieval (IR)
arXiv Detail & Related papers (2023-12-28T06:44:24Z) - Re-Evaluating LiDAR Scene Flow for Autonomous Driving [80.37947791534985]
Popular benchmarks for self-supervised LiDAR scene flow have unrealistic rates of dynamic motion, unrealistic correspondences, and unrealistic sampling patterns.
We evaluate a suite of top methods on a suite of real-world datasets.
We show that despite the emphasis placed on learning, most performance gains are caused by pre- and post-processing steps.
arXiv Detail & Related papers (2023-04-04T22:45:50Z) - Real-Time Visual Feedback to Guide Benchmark Creation: A
Human-and-Metric-in-the-Loop Workflow [22.540665278228975]
We propose VAIDA, a novel benchmark creation paradigm for NLP.
VAIDA focuses on guiding crowdworkers, an under-explored facet of addressing benchmark idiosyncrasies.
We find that VAIDA decreases effort, frustration, mental, and temporal demands of crowdworkers and analysts.
arXiv Detail & Related papers (2023-02-09T04:43:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.