Fugu-MT 論文翻訳(概要): Riemann-Bench: A Benchmark for Moonshot Mathematics

論文の概要: Riemann-Bench: A Benchmark for Moonshot Mathematics

arxiv url: http://arxiv.org/abs/2604.06802v1
Date: Wed, 08 Apr 2026 08:16:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 17:30:51.420799
Title: Riemann-Bench: A Benchmark for Moonshot Mathematics
Title（参考訳）: Riemann-Bench: ムーンショット数学のベンチマーク
Authors: Suhaas Garre, Erik Knutsen, Sushant Mehta, Edwin Chen,
Abstract要約: 近年のAIシステムは国際数学オリンピックでゴールド・メディカルレベルのパフォーマンスを達成した。我々は、研究レベルの数学におけるAIシステムを評価するために設計された25のエキスパートキュレートされた問題のプライベートベンチマークであるベンチを紹介する。
参考スコア（独自算出の注目度）: 0.12430801435092285
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent AI systems have achieved gold-medal-level performance on the International Mathematical Olympiad, demonstrating remarkable proficiency at competition-style problem solving. However, competition mathematics represents only a narrow slice of mathematical reasoning: problems are drawn from limited domains, require minimal advanced machinery, and can often reward insightful tricks over deep theoretical knowledge. We introduce \bench{}, a private benchmark of 25 expert-curated problems designed to evaluate AI systems on research-level mathematics that goes far beyond the olympiad frontier. Problems are authored by Ivy League mathematics professors, graduate students, and PhD-holding IMO medalists, and routinely took their authors weeks to solve independently. Each problem undergoes double-blind verification by two independent domain experts who must solve the problem from scratch, and yields a unique, closed-form solution assessed by programmatic verifiers. We evaluate frontier models as unconstrained research agents, with full access to coding tools, search, and open-ended reasoning, using an unbiased statistical estimator computed over 100 independent runs per problem. Our results reveal that all frontier models currently score below 10\%, exposing a substantial gap between olympiad-level problem solving and genuine research-level mathematical reasoning. By keeping the benchmark fully private, we ensure that measured performance reflects authentic mathematical capability rather than memorization of training data.
Abstract（参考訳）: 最近のAIシステムは、国際数学オリンピアードでゴールド・メディカルレベルのパフォーマンスを達成し、競争スタイルの問題解決に顕著な熟練を誇示している。問題は限られた領域から引き出され、最小限の高度な機械を必要とし、しばしば深い理論的な知識よりも洞察力に富んだトリックを報酬することができる。これは、オリンピアドフロンティアをはるかに超越した研究レベルの数学でAIシステムを評価するために設計された25のエキスパートキュレートされた問題のプライベートベンチマークである。問題はアイビーリーグの数学教授、大学院生、博士号を持つIMOメダリストによって作成され、著者が独立して解決するのに数週間を要した。各問題は、この問題をスクラッチから解決しなければならない2人の独立したドメイン専門家による二重盲検検証が行われ、プログラム的検証によって評価された一意の閉形式解が得られる。我々は,フロンティアモデルを,符号化ツール,検索,オープンエンド推論に完全アクセス可能な制約のない研究エージェントとして評価する。その結果,全てのフロンティアモデルが10倍以下であり,オリンピアードレベルの問題解決と真の研究レベルの数学的推論との間に大きなギャップがあることが判明した。ベンチマークを完全に非公開にすることで、測定された性能がトレーニングデータの記憶よりも正確な数学的能力を反映することを保証する。

論文の概要: Riemann-Bench: A Benchmark for Moonshot Mathematics

関連論文リスト