Fugu-MT 論文翻訳(概要): Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

論文の概要: Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

arxiv url: http://arxiv.org/abs/2606.12736v1
Date: Wed, 10 Jun 2026 22:55:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-12 15:55:27.49513
Title: Benchmarking AI Agents for Addressing Scientific Challenges Across Scales
Title（参考訳）: 規模の異なる科学的課題に対処するためのAIエージェントのベンチマーク
Authors: Tianyu Liu, Allen Xin Wang, Antonia Panescu, Lisa Xinyi Chen, Wenxin Long, Xinyu Wei, Yueqian Jing, Ziyao Zeng, Jihang Chen, Sihan Jiang, Ziqing Wang, Siyi Gu, Siyu Chen, Xinyang Hu, Haoran Shao, Leqi Xu, Wangjie Zheng, Zhiyuan Cao, Ada Fang, Botao Yu, Kunyang Sun, Rex Ying, Arman Cohan, Qingyu Chen, Lingzhou Xue, Kaize Ding, Yuanqi Du, Wengong Jin, Zhuoran Yang, Marinka Zitnik, James Zou, Hua Xu, Hongyu Zhao,
Abstract要約: SciAgentArenaは、現実世界の科学研究シナリオでAIエージェントを評価するための体系的なベンチマークである。ステップワイズ検証を備えた約200のタスクと、多様なAIエージェントを評価するためのインタラクティブでエージェントに依存しない環境で構成される。タスク構造や評価基準が明確である場合, 現状のエージェントはデータ分析に効果的に貢献できることがわかった。
参考スコア（独自算出の注目度）: 118.2204632627895
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: AI agents are increasingly being developed to accelerate scientific discovery, yet their practical capabilities in real research settings remain poorly understood. Existing benchmarks for AI agents rarely capture the complexity, heterogeneity, and extended reasoning required by scientific work, whereas benchmarks for scientific tasks often reduce research to static, direct problems and provide limited support for interactive evaluation. Here, we introduce SciAgentArena, a systematic benchmark for evaluating AI agents in real-world scientific research scenarios drawn from emerging needs across multiple domains. SciAgentArena comprises approximately 200 tasks with stepwise verification and an interactive, agent-agnostic environment for assessing diverse AI agents. Using this benchmark, we find that current agents can contribute effectively to well-specified data-analysis workflows, particularly when the task structure and evaluation criteria are clear. However, their performance remains uneven across scientific contexts: agents struggle to generate genuinely novel insights, sustain self-directed exploration, and formulate robust solutions for open-ended research questions. We further characterize common failure modes across agents and identify opportunities for improving their reliability, autonomy, and scientific reasoning. Together, SciAgentArena provides a practical framework for measuring progress in AI agents for science and for guiding the design of future agents capable of addressing complex scientific challenges. Full codes, tasks, and datasets can be accessed via this link: https://sciagentarena.github.io/.
Abstract（参考訳）: AIエージェントは科学的な発見を加速するために開発が進んでいるが、実際の研究環境でのそれらの実用能力はいまだに理解されていない。既存のAIエージェントのベンチマークは、科学的な作業で必要とされる複雑さ、不均一性、拡張推論をほとんど捉えないが、科学的なタスクのベンチマークは、研究を静的で直接的な問題に還元し、対話的な評価を限定的にサポートする。本稿では、複数の領域にわたる新たなニーズから引き出された実世界の科学研究シナリオにおいて、AIエージェントを評価するための体系的なベンチマークであるSciAgentArenaを紹介する。 SciAgentArenaは、ステップワイド検証を備えた約200のタスクと、多様なAIエージェントを評価するためのインタラクティブでエージェントに依存しない環境で構成されている。このベンチマークを用いて、特にタスク構造と評価基準が明確である場合に、現在のエージェントが適切に特定されたデータ分析ワークフローに効果的に寄与できることが判明した。エージェントは真に新しい洞察を生み出すのに苦労し、自己指向的な探索を継続し、オープンな研究課題に対する堅牢な解決策を定式化します。さらに、エージェント間で共通の障害モードを特徴付け、信頼性、自律性、科学的推論を改善する機会を特定します。 SciAgentArenaは、科学のためのAIエージェントの進歩を計測し、複雑な科学的課題に対処できる将来のエージェントの設計を導くための実践的なフレームワークを提供する。完全なコード、タスク、データセットは、このリンクを通じてアクセスすることができる。

論文の概要: Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

関連論文リスト