Fugu-MT 論文翻訳(概要): How Far Are We From True Auto-Research?

論文の概要: How Far Are We From True Auto-Research?

arxiv url: http://arxiv.org/abs/2605.19156v1
Date: Mon, 18 May 2026 22:20:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.013941
Title: How Far Are We From True Auto-Research?
Title（参考訳）: 真のオートリサーチからどこまで遠いのか?
Authors: Zhengxin Zhang, Ning Wang, Sainyam Galhotra, Claire Cardie,
Abstract要約: ResearchArenaは最小限の足場で、市販のエージェント自身が完全な研究ループを実行できる。 13のコンピュータサイエンスシードとエージェントドメインペア当たりのトライアルで、ResearchArenaは117のエージェント生成論文を生成する。エージェント生成された117の論文のうち、トップレベルの会場の受け入れバーには到達しない。
参考スコア（独自算出の注目度）: 20.195549933333222
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent auto-research systems can produce complete papers, but feasibility is not the same as quality, and the field still lacks a systematic study of how good agent-generated papers actually are. We introduce ResearchArena, a minimal scaffold that lets off-the-shelf agents (Claude Code using Opus 4.6, Codex using GPT-5.4, and Kimi Code using K2.5) carry out the full research loop themselves (ideation, experimentation, paper writing, self-refinement) under only lightweight guidance. Across 13 computer science seeds and 3 trials per agent-domain pair, ResearchArena yields 117 agent-generated papers, each evaluated under three complementary lenses: a manuscript-only reviewer (SAR), an artifact-aware peer review (PR) in which agents inspect the workspace alongside the manuscript, and an human conducted meta-review. Under SAR alone the picture is optimistic: Claude Code obtains the highest score, outperforms Analemma's FARS, and matches the weighted-average human ICLR 2025 submission, suggesting that minimally scaffolded agents can produce papers that look competitive on manuscript-only review. Manual inspection, however, reveals this picture is overstated: SAR scores are poorly aligned with its actual acceptance decisions and reward plausible framing without verifying experimental substance. Under artifact-aware PR scores drop sharply, and manual auditing identifies experimental rigor as the major bottleneck, decomposing into three failure modes (fabricated results, underpowered experiments, and plan/execution mismatch) that are highly agent-dependent: Codex 5%/8% paper-vs-artifact mismatch / fabricated references versus Kimi Code 77%/72%, a $\sim$15$\times$ spread that tracks distinct research personas the agents develop. None of the 117 agent-generated papers reaches the acceptance bar of a top-tier venue. This suggests that we are still gapped from the true auto-research.
Abstract（参考訳）: 最近の自動調査システムでは、完全な論文を作成できるが、実現性は品質と変わらない。我々は,市販のエージェント(Opus 4.6を使用したClaude Code,GPT-5.4を使用したCodex,K2.5を使ったKimi Code)が,ライトウェイトガイダンスのみで完全な研究ループ(イデレーション,実験,ペーパーライティング,セルフリファインメント)を実行することができる最小限の足場であるResearchArenaを紹介した。コンピュータ科学のシード13種とエージェントドメインペアあたりの3つのトライアルのうち、ResearchArenaは117個のエージェント生成論文を、それぞれ3つの補完レンズで評価する。クロード・コード(Claude Code)はAnalemmaのFARSを上回り、重み付けされた平均的なICLR 2025の提出と一致し、最小限の足場のあるエージェントが原稿のみのレビューで競合する論文を作成できることを示唆している。 SARのスコアは、実験物質を検証せずに実際の受理決定や報奨可能なフレーミングと不一致である。 Codex 5%/8% paper-vs-artifact mismatch / fabricated references vs Kimi Code 77%/72%, $\sim$15$\times$ spread that track different research personas the agent developed。エージェント生成された117の論文のうち、トップレベルの会場の受け入れバーには到達しない。これは、我々がまだ真の自動検索から逸脱していることを示唆している。

論文の概要: How Far Are We From True Auto-Research?

関連論文リスト