Summary
This week's theme centers on how LLM-based research agents should be assessed and scaffolded as they move beyond writing support into research planning, experimentation, review, and publication workflows. The central issue is not only generating research-like artifacts but ensuring rigor, reproducibility, and trustworthy evaluation through structured review loops, experimentation frameworks, and rubric-based feedback.
Situation
The representative introductions describe a field shifting from prompt-based assistance toward agentic systems that can propose studies, run experiments, write papers, and participate in peer review. At the same time, they argue that open-ended scientific work is hard to evaluate with the simple executable feedback used in math or coding tasks: experimentation requires controlled procedures and documentation, research plans need domain-aware criteria, and AI-generated papers still face fragmented publication venues and uneven quality control.
As a result, current work is converging on infrastructure for rigor rather than raw generation alone. The papers emphasize closed-loop review and refinement, explicit experimental control modules, benchmark tasks derived from real research problems, and rubric-guided grading of research plans as ways to make AI co-scientists more reliable and interpretable while keeping human judgment central for novelty, scientific value, and accountability.
Infographic (English)

Progress
AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration <See Details on Fugu-MT>
AutoResearchClaw introduces a multi-agent autonomous research pipeline evaluated on the 25-topic ARC-Bench, reporting a 54.7% gain over AI Scientist v2. It adds a concrete experiment-stage benchmark rather than relying on general claims about research automation.
How Far Are We From True Auto-Research? <See Details on Fugu-MT>
ResearchArena tests whether commercial agents can run a full research loop with minimal scaffolding across 13 CS seed topics. It provides a direct quality finding: 117 agent-generated papers still do not meet top-tier venue acceptance standards, grounding the gap between generation capability and scientific rigor.
Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators <See Details on Fugu-MT>
Sibyl-AutoResearch builds a self-evolving framework around scientific trial-and-error harnesses for bounded, inspectable autonomous research. It emphasizes retaining both positive and negative results alongside exposed state, memory, gates, and artifact traces for downstream verification and repair.
FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics <See Details on Fugu-MT>
FML-bench provides a controlled benchmark of 18 ML research tasks across 10 domains, separating agent strategy from execution infrastructure. It defines 12 process-level behavioral metrics, shifting evaluation toward search dynamics rather than final outputs alone.
AI for Auto-Research: Roadmap & User Guide <See Details on Fugu-MT>
AI for Auto-Research surveys AI reliability across the full research lifecycle, identifying stage-dependent boundaries between trustworthy assistance and unreliable autonomy. It argues that evaluation must be stage-specific rather than a single overall score, since artifact generation consistently outpaces verification.
Outlook
Outlook Summary
Near-term progress is likely to focus on stage-specific evaluation for LLM co-scientists. Instead of judging systems by polished papers alone, new benchmarks and lab-style harnesses will test planning, experimentation, and revision cycles through executable artifacts, provenance traces, and process metrics. This direction is reinforced by ARC-Bench, FML-bench, and full-loop testing, which show that lightly scaffolded agents still fall short of publication-quality work. A second direction is to turn plans, experiments, and reviews into training signal for more adaptive research assistants. Richer rubric feedback, language-based critique, and knowledge reuse should help systems improve, but longer-horizon work will still need structured oversight by humans.
Infographic (English)

Three-Year Movement
The current outlook already points toward this scenario. The mechanism is process assurance: trust comes from checking the research workflow, not only the final manuscript. The HACCP analogy helps here because food safety depends on critical control points. In AI-assisted research, those control points include plans, tool calls, and review trails.
In the first year, the main movement is toward separate scores for planning, experiment control, and revision quality. Labs would use these systems first in bounded workflows, such as reproducibility checks or draft critique. By the end of that year, serious pilots would be gated by provenance, tool use records, and human escalation rules. This keeps the systems useful without assuming fully autonomous discovery.
In the second year, the field becomes more standardized. Research groups start to align around shared audit fields that describe the task, the allowed tools, and the human intervention record. Applications also become more formal, as organizations ask for evidence of how an AI system contributed to a result. The important change is that the record must show what the agent did, what evidence it used, and where a human corrected it.
By the third year, the movement is toward role-specific co-scientist systems rather than one general autonomous scientist. Some systems may be trusted for planning support, digital experiment control, or reproducibility checking. The surrounding infrastructure also matures through controlled sandboxes, provenance stores, and supervisor dashboards. A key monitoring cue is whether leading evaluations reject systems that cannot provide replayable artifacts or exposed intermediate steps. The caveat is that science is not mass production. If audit layers become rigid bureaucracy instead of support for judgment and discovery, this scenario loses much of its value.
This scenario also grows out of the current outlook, but it stresses a bottleneck. The field may agree that stage-specific evaluation is needed, yet find that the evaluation infrastructure itself is hard to scale. Full-loop testing needs controlled tools, metered compute, expert review time, and provenance systems. The mechanism is a shared-grid problem: many teams can build agents, but reliable comparison needs common evaluation utilities.
In the first year, the pressure appears as benchmark brownouts. Scores become hard to compare because teams use different search access, compute limits, and review rules. Researchers respond by studying reproducibility failures and by designing more efficient evaluation methods. Hosted testing environments begin to appear for planning assistants, code-heavy agents, and review-refinement tools. A monitoring cue is public evidence of non-reproducible benchmark results, merged benchmark infrastructure, or venues asking for tool-environment disclosure.
In the second year, separate evaluation efforts start to consolidate. Shared utilities provide common sandboxes, repeatable tasks, and scoring meters. The reason is simple arithmetic: many agent runs and many expert judgments do not scale well across every new benchmark. Research then focuses on cost-normalized scoring, staged human review, and rubric reuse. Applied groups increasingly treat platform-generated traces as evidence when deciding whether a tool is dependable.
By the third year, consolidation brings clearer comparisons but also new tension. For computer-science-adjacent tasks, shared utilities reduce duplicated work and make agent behavior easier to inspect. Slower-feedback domains remain harder because they need domain experts and longer evidence cycles. Governance therefore becomes central, with standards groups pushing interoperable records for tool logs, artifact trails, and stage-specific scores. The caveat is that research quality cannot be reduced to one universal meter. The scenario weakens if automated evaluators become reliable enough to replace much expert review, or if open sandbox tooling becomes cheap and mature enough to prevent platform concentration.
This scenario takes the current evaluation trend and makes it more formal. The mechanism is phase-gated testing, borrowed from medicine in a loose sense. A system would first need to show reliable behavior and usable records, then perform well in controlled tasks, and only later enter supervised research workflows. The trigger would likely be a visible trust shock involving fabricated citations, weak provenance, or compromised review behavior.
In the first year, research moves toward a shared language for evidence tiers. Existing rigor checks, controlled benchmark environments, and review systems would be mapped to different stages of confidence. Practical use would start cautiously. Research offices and publication venues would test checklists, disclosure fields, and audit packs rather than approve autonomous research systems. Scientists would get clearer rules about which tools can draft plans, inspect literature, or run bounded computational experiments.
In the second year, the feedback loop becomes the main force. If sponsors and venues ask for tier reporting, builders have a reason to create better traces, reliability engines, and failure logs. Those records then become training and evaluation data for newer co-scientist systems. Evaluation research also has to test transfer across fields. A tool that works on short programming tasks may not work in biomedical, materials, or social-science settings where feedback is slower.
By the third year, tiered evaluation could become normal infrastructure for publishable AI-assisted research. Some narrow-scope co-scientists might be certified for supervised use, while uncertified assistants remain limited to sandboxed ideation or education. Review-style platforms could also act as surveillance systems by collecting reports on hallucinated citations, prompt-injection events, and weak provenance. A key monitoring cue is whether major research sponsors and venues begin requiring tiered evidence. The caveat is that this path needs coordination across sponsors, venues, and standards groups. It weakens if the field stays with generic AI-use disclosure, if one composite benchmark dominates, or if tool-normalization disputes stall.
1-Year / 3-Year Research-Application Infographic

References
- Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents - Authors: Patrick Tser Jern Kon, Jiachen Liu, Qiuyi Ding, Yiming Qiu, Zhenning Yang, Yibo Huang, Jayanth Srinivasa, Myungjin Lee, Mosharaf Chowdhury, Ang Chen, / <See Details on Fugu-MT> / License: CC0-1.0
- aiXiv: A Next-Generation Open Access Ecosystem for Scientific Discovery Generated by AI Scientists - Authors: Pengsong Zhang, Xiang Hu, Guowei Huang, Yang Qi, Heng Zhang, Xiuxu Li, Jiaxing Song, Jiabin Luo, Yijiang Li, Shuo Yin, Chengxiao Dai, Eric Hanchen Jiang, Xiaoyan Zhou, Zhenfei Yin, Boqin Yuan, Jing Dong, Guinan Su, Guanren Qiao, Haiming Tang, Anghong Du, Lili Pan, Zhenzhong Lan, Xinyu Liu, / <See Details on Fugu-MT> / License: CC-BY-4.0
- Training AI Co-Scientists Using Rubric Rewards - Authors: Shashwat Goel, Rishi Hazra, Dulhan Jayalath, Timon Willi, Parag Jain, William F. Shen, Ilias Leontiadis, Francesco Barbieri, Yoram Bachrach, Jonas Geiping, Chenxi Whitehouse, / <See Details on Fugu-MT> / License: CC-BY-4.0