The Last Human-Written Paper: Agent-Native Research Artifacts
Abstract Overview
This paper introduces Ara (Agent-Native Research Artifact), a protocol that replaces narrative research papers with a machine-executable package organized into four layers: scientific logic (/logic), executable code (/src), an exploration graph preserving failed and successful research trajectories (/trace), and grounded evidence (/evidence). The authors argue that conventional papers impose a "Storytelling Tax" (discarding failed experiments and branching research processes) and an "Engineering Tax" (omitting execution-critical details such as hyperparameters and configurations). Three supporting mechanisms are presented: a Live Research Manager that captures decisions during researcher–agent coding sessions, an Ara Compiler that converts legacy PDFs and repositories into Ara format, and a three-level ARA Seal review system for machine-verifiable structural, rigor, and reproducibility checks. The protocol is evaluated on knowledge extraction, reproduction, and extension tasks using PaperBench and RE-Bench sources, restricted to the machine learning domain.
Novelty
The primary novelty is reframing the primary research output as an agent-operable, four-layer filesystem artifact with explicit cross-layer bindings linking claims, code, evidence, and research trajectories—including dead ends—rather than a human-oriented narrative. The work additionally couples this protocol with a live capture mechanism for researcher–agent sessions, a compiler for backward-compatible conversion of legacy papers, and a staged machine-verifiable review pipeline (the ARA Seal).
Results
In knowledge extraction (450 questions across 30 targets), agents using Ara achieved 93.7% accuracy versus 72.4% for the PDF-plus-repository baseline, with the largest gains on failure-knowledge questions (+65.7 pp) and configuration-detail recovery (+24.8 pp). In reproduction across 15 papers (150 subtasks, 1,743 rubric requirements), Ara reached a difficulty-weighted success rate of 64.4% versus 57.4% for the baseline, with the advantage widening on harder subtasks (+8.5 pp on hard). In extension on five RE-Bench tasks under Sonnet 4.6, Ara led to earlier useful progress on all five tasks and better final scores on three of five, while the review mutation benchmark showed 100% detection of fabricated claims, rebutted-branch leaks, and over-claims, but only 22% detection of orphan experiments.
Key Points
- Ara structures research into four linked layers—scientific logic, executable code, exploration graph (including dead ends), and grounded evidence—connected by cross-layer bindings, to preserve information that narrative papers typically flatten or omit.
- The ecosystem includes a Live Research Manager for zero-overhead capture during AI-native development, a Compiler for converting legacy PDFs and repositories, and a three-level ARA Seal review pipeline that automates structural, rigor, and reproducibility verification before human review.
- Empirical evaluation on ML papers shows Ara improves agent accuracy on knowledge extraction (+21.3 pp overall), increases difficulty-weighted reproduction success rates (+7.0 pp, growing with task difficulty), and accelerates early-stage extension work, though late-phase reversals on two of five extension tasks suggest the trace's value depends on the gap between documented strategies and the agent's own discovery capacity.