Fugu-MT 論文翻訳(概要): Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

論文の概要: Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

arxiv url: http://arxiv.org/abs/2509.26574v2
Date: Wed, 01 Oct 2025 02:12:55 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-02 12:11:26.814992
Title: Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
Title（参考訳）: AI推論の臨界点(CritPt)を証明する - 最前線の物理研究ベンチマーク
Authors: Minhui Zhu, Minyang Tian, Xiaocheng Yang, Tianci Zhou, Penghao Zhu, Eli Chertkov, Shengyan Liu, Yufeng Du, Lifan Yuan, Ziming Ji, Indranil Das, Junyi Cao, Yufeng Du, Jinchen He, Yifan Su, Jiabin Yu, Yikun Jiang, Yujie Zhang, Chang Liu, Ze-Min Huang, Weizhen Jia, Xinan Chen, Peixue Wu, Yunkai Wang, Juntai Zhou, Yong Zhao, Farshid Jafarpour, Jessie Shelton, Aaron Young, John Bartolotta, Wenchao Xu, Yue Sun, Anjun Chu, Victor Colussi, Chris Akers, Nathan Brooks, Wenbo Fu, Christopher Wilson, Jinchao Zhao, Marvin Qi, Anqi Mu, Yubo Yang, Allen Zang, Yang Lyu, Peizhi Mai, Xuefei Guo, Luyu Gao, Ze Yang, Chi Xue, Dmytro Bandak, Yaïr Hein, Yonatan Kahn, Kevin Zhou, John Drew Wilson, Jarrod T. Reilly, Di Luo, Daniel Inafuku, Hao Tong, Liang Yang, Ruixing Zhang, Xueying Wang, Ofir Press, Nicolas Chia, Eliu Huerta, Hao Peng,
Abstract要約: 本研究では,研究レベルの推論タスクにおいて,大規模言語モデル(LLM)をテストするために設計された最初のベンチマークを示す。 CritPtは71の複合研究課題からなる。現在最先端のLCMは、孤立したチェックポイントを早期に保証しているが、完全な研究スケールの課題を確実に解決できるには程遠い。
参考スコア（独自算出の注目度）: 49.42250115889234
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While large language models (LLMs) with reasoning capabilities are progressing rapidly on high-school math competitions and coding, can they reason effectively through complex, open-ended challenges found in frontier physics research? And crucially, what kinds of reasoning tasks do physicists want LLMs to assist with? To address these questions, we present the CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced "critical point"), the first benchmark designed to test LLMs on unpublished, research-level reasoning tasks that broadly covers modern physics research areas, including condensed matter, quantum physics, atomic, molecular & optical physics, astrophysics, high energy physics, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics. CritPt consists of 71 composite research challenges designed to simulate full-scale research projects at the entry level, which are also decomposed to 190 simpler checkpoint tasks for more fine-grained insights. All problems are newly created by 50+ active physics researchers based on their own research. Every problem is hand-curated to admit a guess-resistant and machine-verifiable answer and is evaluated by an automated grading pipeline heavily customized for advanced physics-specific output formats. We find that while current state-of-the-art LLMs show early promise on isolated checkpoints, they remain far from being able to reliably solve full research-scale challenges: the best average accuracy among base models is only 4.0% , achieved by GPT-5 (high), moderately rising to around 10% when equipped with coding tools. Through the realistic yet standardized evaluation offered by CritPt, we highlight a large disconnect between current model capabilities and realistic physics research demands, offering a foundation to guide the development of scientifically grounded AI tools.
Abstract（参考訳）: 推論能力を持つ大規模言語モデル(LLM)は、高校の数学コンペティションやコーディングで急速に進歩している一方で、フロンティア物理学の研究で見出された複雑でオープンな課題を通じて効果的に推論できるだろうか? そして重要なことは、物理学者がLLMにどんな理由付けをして欲しいのかということです。これらの問題に対処するため、CritPt (Complex Research using Integrated Thinking - Physics Test, 発音は「臨界点」) は、凝縮物質、量子物理学、原子・分子・光学物理学、天体物理学、高エネルギー物理学、物理物理学、統計物理学、核物理学、非線形力学、流体力学、生物物理学など、現代の物理学研究分野を幅広くカバーする未発表の研究レベルの推論タスクでLSMをテストするために設計された最初のベンチマークである。 CritPtは71の複合的な研究課題で構成されており、エントリーレベルでの本格的な研究プロジェクトをシミュレートするために設計されており、より詳細な洞察を得るために190の単純なチェックポイントタスクに分解される。あらゆる問題は、独自の研究に基づいて、50人以上の活動物理学研究者によって新たに生み出された。あらゆる問題が手作業で計算され、推理に耐性があり、機械で検証可能な答えが与えられ、高度な物理固有の出力フォーマットのために高度にカスタマイズされた自動グレーティングパイプラインによって評価される。現在最先端のLCMは、孤立したチェックポイントを早期に保証しているが、ベースモデルの最高の平均精度は4.0%に過ぎず、コーディングツールを備えた場合、GPT-5(ハイ)は適度に10%程度まで上昇している。 CritPtが提供した現実的かつ標準化された評価を通じて、現在のモデル能力と現実的な物理研究の要求との間に大きな隔たりがあることを強調し、科学的な基盤を持つAIツールの開発をガイドする基盤を提供する。

論文の概要: Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

関連論文リスト