Artisan: Agentic Artifact Evaluation
- URL: http://arxiv.org/abs/2602.10046v1
- Date: Tue, 10 Feb 2026 18:15:48 GMT
- Title: Artisan: Agentic Artifact Evaluation
- Authors: Doehyun Baek, Michael Pradel,
- Abstract summary: Artifact evaluation has become standard practice in the software engineering community to ensure the veracity of research results.<n>We present Artisan, an automated LLM agent for reproducing research results given a paper and its artifact.
- Score: 14.265317773238529
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Artifact evaluation has become standard practice in the software engineering community to ensure the reproducibility of research results. However, the current manual process is labor-intensive, and hence, done only as a one-time assessment for a subset of all papers. To support the artifact evaluation effort, we present Artisan, an automated LLM agent for reproducing research results given a paper and its artifact. The approach is enabled by two key contributions: First, we frame the reproduction problem as a code generation task where the goal is to generate a reproduction script that, when executed, reproduces the results reported in a paper. Unlike prior work on automatically reproducing research results in other domains, this formulation allows for running the script independently of the agent and for assessing the reproduction process at a fine-grained level. Second, we design automated judging mechanism that guides the agent toward the expected results without revealing them and that prevent trivial solutions, such as simply copying checked-in results. To evaluate Artisan, we introduce Artisan-Bench, the first benchmark assessing the ability to generate reproduction scripts and the first benchmark for automated artifact evaluation in software engineering. Artisan-Bench comprises 60 tasks derived from 23 software engineering papers, covering different research areas and programming languages. We validate all tasks in Artisan-Bench for reproducibility to ensure that the tasks are feasible. Our experiments show that Artisan is effective, producing 44/60 reproduction scripts and outperforming the best available baseline, a vanilla LLM agent (mini-swe-agent), by 3.14$\times$ in terms of reproduction scripts generated while taking $0.45 and 48 minutes, on average per task. Artisan also helped uncover 20 new errors in either the paper or artifact.
Related papers
- ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences [19.81372090301296]
ReplicatorBench is an end-to-end benchmark for evaluating AI agents in research replication across three stages.<n>We develop ReplicatorAgent, an agentic framework equipped with necessary tools, like web search and iterative interaction with sandboxed environments.<n>We evaluate ReplicatorAgent across four underlying large language models (LLMs), as well as different design choices of programming language and levels of code access.
arXiv Detail & Related papers (2026-02-11T20:42:10Z) - Agent-Based Software Artifact Evaluation [15.526715803442746]
Artifact evaluation has been adopted in the Software Engineering (SE) research community for 15 years.<n>We propose ArtifactCopilot, the first end-to-end agent-based framework for automated artifact evaluation.
arXiv Detail & Related papers (2026-02-02T15:41:16Z) - Automated Validation of LLM-based Evaluators for Software Engineering Artifacts [0.7548538278943616]
REFINE (Ranking Evaluators for FIne grained Nuanced Evaluation) is an automated framework for benchmarking large language models (LLMs)<n> REFINE applies novel generation techniques to automatically synthesize artifacts with progressively reduced quality.<n>It quantifies each candidate evaluator configuration by measuring how closely its rankings align with expected ordering.
arXiv Detail & Related papers (2025-08-04T18:52:01Z) - Do AI models help produce verified bug fixes? [62.985237003585674]
Large Language Models are used to produce corrections to software bugs.<n>This paper investigates how programmers use Large Language Models to complement their own skills.<n>The results are a first step towards a proper role for AI and LLMs in providing guaranteed-correct fixes to program bugs.
arXiv Detail & Related papers (2025-07-21T17:30:16Z) - ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation [51.297873393639456]
ArtifactsBench is a framework for automated visual code generation evaluation.<n>Our framework renders each generated artifact and captures its dynamic behavior through temporal screenshots.<n>We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading Large Language Models.
arXiv Detail & Related papers (2025-07-07T12:53:00Z) - From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking [48.90371827091671]
AutoExperiment is a benchmark that evaluates AI agents' ability to implement and run machine learning experiments.<n>We evaluate state-of-the-art agents and find that performance degrades rapidly as $n$ increases.<n>Our findings highlight critical challenges in long-horizon code generation, context retrieval, and autonomous experiment execution.
arXiv Detail & Related papers (2025-06-24T15:39:20Z) - AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage [62.049868205196425]
AutoReproduce is a framework capable of automatically reproducing experiments described in research papers in an end-to-end manner.<n>Results show that AutoReproduce achieves an average performance gap of $22.1%$ on $89.74%$ of the executable experiment runs.
arXiv Detail & Related papers (2025-05-27T03:15:21Z) - FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks [52.47895046206854]
FieldWorkArena is a benchmark for agentic AI targeting real-world field work.<n>This paper defines a new action space that agentic AI should possess for real world work environment benchmarks.
arXiv Detail & Related papers (2025-05-26T08:21:46Z) - PaperBench: Evaluating AI's Ability to Replicate AI Research [3.4567792239799133]
PaperBench is a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research.<n>Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch.<n>PaperBench contains 8,316 individually gradable tasks.
arXiv Detail & Related papers (2025-04-02T15:55:24Z) - Re-ReST: Reflection-Reinforced Self-Training for Language Agents [101.22559705696885]
Self-training in language agents can generate supervision from the agent itself.<n>We present Reflection-Reinforced Self-Training (Re-ReST), which uses a textitreflector to refine low-quality generated samples.
arXiv Detail & Related papers (2024-06-03T16:21:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.