Fugu-MT 論文翻訳(概要): ELAIPBench: A Benchmark for Expert-Level Artificial Intelligence Paper Understanding

論文の概要: ELAIPBench: A Benchmark for Expert-Level Artificial Intelligence Paper Understanding

arxiv url: http://arxiv.org/abs/2510.10549v1
Date: Sun, 12 Oct 2025 11:11:20 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:30.003363
Title: ELAIPBench: A Benchmark for Expert-Level Artificial Intelligence Paper Understanding
Title（参考訳）: ELAIPBench: エキスパートレベルの人工知能論文理解のためのベンチマーク
Authors: Xinbang Dai, Huikang Hu, Yongrui Chen, Jiaqi Li, Rihui Jin, Yuyang Zhang, Xiaoguang Li, Lifeng Shang, Guilin Qi,
Abstract要約: ELAIPBenchは、大規模言語モデルによるAI研究論文の理解を評価するために、ドメインの専門家によってキュレーションされたベンチマークである。難易度は3つあり、浅い検索よりも非自明な推論に重点を置いている。実験の結果、最高の性能のLSMは、人間の性能よりはるかに低い39.95%の精度しか達成できないことがわかった。
参考スコア（独自算出の注目度）: 49.67493845115009
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While large language models (LLMs) excel at many domain-specific tasks, their ability to deeply comprehend and reason about full-length academic papers remains underexplored. Existing benchmarks often fall short of capturing such depth, either due to surface-level question design or unreliable evaluation metrics. To address this gap, we introduce ELAIPBench, a benchmark curated by domain experts to evaluate LLMs' comprehension of artificial intelligence (AI) research papers. Developed through an incentive-driven, adversarial annotation process, ELAIPBench features 403 multiple-choice questions from 137 papers. It spans three difficulty levels and emphasizes non-trivial reasoning rather than shallow retrieval. Our experiments show that the best-performing LLM achieves an accuracy of only 39.95%, far below human performance. Moreover, we observe that frontier LLMs equipped with a thinking mode or a retrieval-augmented generation (RAG) system fail to improve final results-even harming accuracy due to overthinking or noisy retrieval. These findings underscore the significant gap between current LLM capabilities and genuine comprehension of academic papers.
Abstract（参考訳）: 大規模言語モデル(LLM)は多くのドメイン特化課題において優れているが、完全な学術論文について深く理解し、推論する能力はいまだに未解明である。既存のベンチマークは、表面レベルの質問設計や信頼性の低い評価指標によって、そのような深さを捉えていない場合が多い。このギャップに対処するため,ドメインの専門家によるLEMsの人工知能(AI)研究論文の理解を評価するためのベンチマークであるELAIPBenchを紹介した。 ELAIPBenchはインセンティブ駆動の敵対的アノテーションプロセスによって開発され、137の論文から403の質問が寄せられている。難易度は3つあり、浅い検索よりも非自明な推論に重点を置いている。実験の結果,最高性能のLDMは,人的性能よりはるかに低い39.95%の精度で達成できることがわかった。さらに、思考モードや検索拡張生成(RAG)システムを備えたフロンティアLLMは、過度に考え抜かれたり、騒々しい検索によって、最終結果の精度を損なうことなく、最終的な結果を改善することができないことを観察する。これらの知見は,現在のLLM能力と学術論文の真の理解との間に有意なギャップがあることを浮き彫りにしている。

論文の概要: ELAIPBench: A Benchmark for Expert-Level Artificial Intelligence Paper Understanding

関連論文リスト