Fugu-MT 論文翻訳(概要): PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research

論文の概要: PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research

arxiv url: http://arxiv.org/abs/2604.15411v1
Date: Thu, 16 Apr 2026 16:22:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-20 22:00:19.593941
Title: PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research
Title（参考訳）: PRL-Bench:フロンティア物理研究におけるLLMの能力評価のための総合ベンチマーク
Authors: Tingjia Miao, Wenkai Jin, Muhua Zhang, Jinxin Tan, Yuelin Hu, Tu Guo, Jiejun Zhang, Yuhan Wang, Wenbo Li, Yinuo Gao, Shuo Chen, Weiqi Jiang, Yayun Hu, Zixing Lei, Xianghe Pang, Zexi Liu, Yuzhi Zhang, Linfeng Zhang, Kun Chen, Wei Wang, Weinan E, Siheng Chen,
Abstract要約: PRL-Benchは、エンドツーエンドの物理研究を実行するためのベンチマークである。天体物理学、凝縮物質物理学、高エネルギー物理学、量子情報、統計物理学をカバーしている。ベンチマークの各タスクは、真の科学研究のコア特性を再現するように設計されている。
参考スコア（独自算出の注目度）: 43.71141859083647
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The paradigm of agentic science requires AI systems to conduct robust reasoning and engage in long-horizon, autonomous exploration. However, current scientific benchmarks remain confined to domain knowledge comprehension and complex reasoning, failing to evaluate the exploratory nature and procedural complexity of real-world research. In this work, we present research-oriented evaluations in theoretical and computational physics, a natural testbed with comprehensive domain knowledge, complex reasoning, and verifiable end-to-end workflows without reliance on experiments. Here we introduce PRL-Bench (Physics Research by LLMs), a benchmark designed to systematically map the capability boundaries of LLMs in executing end-to-end physics research. Constructed from 100 curated papers from the latest issues of Physical Review Letters since August 2025 and validated by domain experts, PRL-Bench covers five major theory- and computation-intensive subfields of modern physics: astrophysics, condensed matter physics, high-energy physics, quantum information, and statistical physics. Each task in the benchmark is designed to replicate the core properties of authentic scientific research, including exploration-oriented formulation, long-horizon workflows, and objective verifiability, thereby reconstructing the essential reasoning processes and research workflows of real physics research. Evaluation across frontier models shows that performance remains limited, with the best overall score below 50, revealing a pronounced gap between current LLM capabilities and the demands of real scientific research. PRL-Bench serves a reliable testbed for accessing next generation AI scientists advancing AI systems toward autonomous scientific discovery.
Abstract（参考訳）: エージェント科学のパラダイムは、AIシステムが堅牢な推論を行い、長期の自律的な探査に従事することを要求する。しかし、現在の科学的ベンチマークは領域知識の理解と複雑な推論に限られており、現実世界の研究の探索的性質と手続き的複雑さを評価できない。本研究では,理論および計算物理学における研究指向評価,包括的ドメイン知識,複雑な推論,検証可能なエンドツーエンドワークフローを実験に依存しない自然なテストベッドを提案する。本稿では, PRL-Bench (Physics Research by LLMs) について紹介する。 PRL-Benchは2025年8月以降のフィジカル・レビュー・レターの最新号から100のキュレートされた論文から作成され、ドメインの専門家によって検証され、天文学、凝縮物質物理学、高エネルギー物理学、量子情報、統計物理学の5つの主要な理論と計算集約サブフィールドをカバーしている。ベンチマークの各タスクは、探索指向の定式化、長距離ワークフロー、客観的検証可能性など、真正科学研究のコア特性を再現し、実際の物理学研究の本質的な推論プロセスと研究ワークフローを再構築するように設計されている。最良スコアは50未満であり、現在のLLM能力と実際の科学研究の要求との間に明らかなギャップがあることが示されている。 PRL-Benchは、AIシステムを自律的な科学的発見に向けて進める次世代AI科学者に、信頼できるテストベッドを提供する。

論文の概要: PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research

関連論文リスト