Fugu-MT 論文翻訳(概要): ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems

論文の概要: ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems

arxiv url: http://arxiv.org/abs/2510.11652v1
Date: Mon, 13 Oct 2025 17:30:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:30.48012
Title: ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems
Title（参考訳）: ACADREASON:学術研究問題を考慮した推論モデルの限界を探る
Authors: Xin Gui, King Zhu, JinCheng Ren, Qianben Chen, Zekun Moore Wang, Yizhi LI, Xinpeng Liu, Xiaowan Li, Wenli Ren, Linyu Miao, Tianrui Qin, Ziqi Shu, He Zhu, Xiangru Tang, Dingfeng Shi, Jiaheng Liu, Yuchen Eleanor Jiang, Minghao Liu, Ge Zhang, Wangchunshu Zhou,
Abstract要約: Acadreasonベンチマークは、LLMとエージェントが学術的知識を習得し、推論する能力を評価するために設計されている。コンピュータ科学、経済学、法学、数学、哲学を含む5つの高レベル分野にまたがる50の専門的注釈付き学術問題で構成されている。その結果、ほとんどのLPMは20点以下であり、最先端のGPT-5でも16点しか獲得できなかった。
参考スコア（独自算出の注目度）: 47.451132653010774
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent years, the research focus of large language models (LLMs) and agents has shifted increasingly from demonstrating novel capabilities to complex reasoning and tackling challenging tasks. However, existing evaluations focus mainly on math/code contests or general tasks, while existing multi-domain academic benchmarks lack sufficient reasoning depth, leaving the field without a rigorous benchmark for high-level reasoning. To fill this gap, we introduce the Acadreason benchmark, designed to evaluate the ability of LLMs and agents to acquire and reason over academic knowledge. It consists of 50 expert-annotated academic problems across five high-reasoning domains, including computer science, economics, law, mathematics, and philosophy. All questions are sourced from top-tier publications in recent years and undergo rigorous annotation and quality control to ensure they are both challenging and answerable. We conduct systematic evaluations of over 10 mainstream LLMs and agents. The results show that most LLMs scored below 20 points, with even the cutting-edge GPT-5 achieving only 16 points. While agents achieved higher scores, none exceeded 40 points. This demonstrates the current capability gap between LLMs and agents in super-intelligent academic research tasks and highlights the challenges of Acadreason.
Abstract（参考訳）: 近年,大規模言語モデル(LLM)とエージェントの研究の焦点は,新たな能力の実証から複雑な推論,困難なタスクへの対処へと移りつつある。しかし、既存の評価は算数/コードコンテストや一般的なタスクに重点を置いているのに対し、既存のマルチドメインの学術ベンチマークは十分な推論深度を欠いており、高レベルの推論のための厳密なベンチマークは残っていない。このギャップを埋めるために,学術知識の獲得と推論を行うLLMとエージェントの能力を評価するために設計されたAcadreasonベンチマークを導入する。コンピュータ科学、経済学、法学、数学、哲学を含む5つの高レベル分野にまたがる50の専門的注釈付き学術問題で構成されている。すべての質問は、近年のトップレベルの出版物から導き出され、厳格なアノテーションと品質管理を受けており、それらが挑戦的かつ答えやすいものであることを保証しています。 10以上のLLMおよびエージェントの系統的評価を行う。その結果、ほとんどのLPMは20点以下であり、最先端のGPT-5でも16点しか獲得できなかった。エージェントはより高いスコアを獲得したが、40点を超えなかった。これは、超知能な学術研究課題におけるLLMとエージェントの現在の能力ギャップを示し、アカデアソンの課題を強調している。

論文の概要: ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems

関連論文リスト