FuguReport

AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery

Authors Guiyao Tie, Jiawen Shi, Dingjie Song, Yixiao Huang, Ziji Sheng, Xueyang Zhou, Daizong Liu, Pan Zhou, Yongchao Chen, Ran Xu, Lifang He, Qingsong Wen, Manling Li, Cong Lu, Shuai Li, Pengtao Xie, Yixuan Yuan, Rui Meng, Lei Xing, Lichao Sun, Caiming Xiong, Philip S. Yu, Jianfeng Gao
Affiliations Recursive Superintelligence / Shanghai Jiao Tong University / Wuhan University / Lehigh University / Huazhong University of Science and Technology / Microsoft / Stanford University / Squirrel Ai Learning / Independent / The Chinese University of Hong Kong / University of Illinois Chicago / Northwestern University / Salesforce / Google / University of California, San Diego / Tsinghua University
Categories Method / Research Automation / AI systems for scientific workflow automation, Evaluation / Model Evaluation / Assessment using novelty and impact dimensions, Application / Scientific Discovery / AI autonomy under domain-specific conditions
License CC BY 4.0

Abstract Overview

This paper surveys the emerging shift from task-level AI for Science toward workflow-level research automation, which the authors term AutoResearch. Using a workflow-centered lens, it compares systems by how they redistribute control, execution, validation, evidence handling, and accountability across stages such as literature grounding, hypothesis formation, experimentation, review, and reporting. The paper introduces a five-level autonomy spectrum from L0 to L4, distinguishing human-steered "Vibe Research" at L1-L2 from more stringent AI-led autonomy targets at L3-L4. It argues that current systems show meaningful progress in search, drafting, coding, and bounded execution, but remain limited in validation, reproducibility, provenance, rejection of weak directions, and accountable scientific closure. The survey further contends that attainable autonomy is strongly domain-dependent, with higher levels appearing more credible in structured, executable, and rapidly verifiable settings than in embodied or high-stakes scientific domains.

Novelty

The paper’s main novelty is a unified workflow-level framing of AI-powered research automation rather than a taxonomy organized only by model type, agent architecture, or benchmark scores. It introduces a conservative L0-L4 autonomy spectrum, distinguishes the L1-L2 "Vibe Research" region from stricter AutoResearch targets, and proposes five evaluation dimensions—novelty, validity, impact, reliability, and provenance—for judging scientific credibility.

Results

As a survey, the paper’s primary outcomes are conceptual and organizational rather than experimental. It synthesizes prior systems, benchmarks, infrastructures, and domain deployments into a common framework, and argues that most current end-to-end research pipelines are better understood as advanced human-verified L2 systems rather than mature L3 autonomy. It also identifies domain-conditioned autonomy ceilings, emphasizing that stronger automation is presently more plausible where artifacts are machine-readable, executable, and easier to audit.

Key Points

  1. The paper defines AutoResearch as workflow-level AI participation in scientific inquiry and formalizes it with an L0-L4 autonomy spectrum.
  2. It organizes the technical foundations of research automation around five recurring workflow conditions spanning grounding, planning, experimentation, validation, and reporting.
  3. It argues that evaluation should emphasize scientific credibility—especially novelty, validity, impact, reliability, and provenance—and that feasible autonomy depends strongly on the scientific domain.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.