Fugu-MT 論文翻訳(概要): Towards Robust Mathematical Reasoning

論文の概要: Towards Robust Mathematical Reasoning

arxiv url: http://arxiv.org/abs/2511.01846v1
Date: Mon, 03 Nov 2025 18:53:02 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 16:37:27.379165
Title: Towards Robust Mathematical Reasoning
Title（参考訳）: ロバストな数学的推論に向けて
Authors: Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V. Le, Junehyuk Jung,
Abstract要約: IMO-Benchは、トップスペシャリストのパネルによって検証された高度な推論ベンチマークスイートである。 IMO-AnswerBenchは400の多様なオリンピアード問題に対して、検証可能な短い答えでモデルを最初にテストした。 IMO-Proof Benchは、証明書記能力の次のレベル評価である。
参考スコア（独自算出の注目度）: 41.319782208621156
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Finding the right north-star metrics is highly critical for advancing the mathematical reasoning capabilities of foundation models, especially given that existing evaluations are either too easy or only focus on getting correct short answers. To address these issues, we present IMO-Bench, a suite of advanced reasoning benchmarks, vetted by a panel of top specialists and that specifically targets the level of the International Mathematical Olympiad (IMO), the most prestigious venue for young mathematicians. IMO-AnswerBench first tests models on 400 diverse Olympiad problems with verifiable short answers. IMO-Proof Bench is the next-level evaluation for proof-writing capabilities, which includes both basic and advanced IMO level problems as well as detailed grading guidelines to facilitate automatic grading. These benchmarks played a crucial role in our historic achievement of the gold-level performance at IMO 2025 with Gemini Deep Think (Luong and Lockhart, 2025). Our model achieved 80.0% on IMO-AnswerBench and 65.7% on the advanced IMO-Proof Bench, surpassing the best non-Gemini models by large margins of 6.9% and 42.4% respectively. We also showed that autograders built with Gemini reasoning correlate well with human evaluations and construct IMO-GradingBench, with 1000 human gradings on proofs, to enable further progress in automatic evaluation of long-form answers. We hope that IMO-Bench will help the community towards advancing robust mathematical reasoning and release it at https://imobench.github.io/.
Abstract（参考訳）: 特に、既存の評価は簡単すぎるか、単に正しい短い答えを得ることにのみ焦点を当てているため、基礎モデルの数学的推論能力を向上させるためには、右の北星のメトリクスを見つけることが非常に重要である。これらの問題に対処するため、我々は、先進的な推論ベンチマークのスイートであるIMO-Benchをトップスペシャリストのパネルで検証し、若手数学者にとって最も権威ある場所である国際数学オリンピック(IMO)のレベルを特にターゲットとしている。 IMO-AnswerBenchは400の多様なオリンピアード問題に対して、検証可能な短い答えでモデルを最初にテストした。 IMO-Proof Benchは、基本的なIMOレベルの問題と高度なIMOレベルの問題と、自動グルーピングを容易にする詳細なグルーピングガイドラインを含む、証明書記能力の次のレベル評価である。これらのベンチマークは、Gemini Deep Think氏(Luong and Lockhart, 2025)によるIMO 2025におけるゴールドレベルのパフォーマンスの歴史的達成において重要な役割を担った。 IMO-AnswerBenchで80.0%、進歩型IMO-Proof Benchで65.7%、ジェニーニ以外のモデルで6.9%、42.4%をそれぞれ上回りました。また,Gemini推論で構築したオートグラファーは人間の評価とよく相関し,IMO-GradingBenchを1000人の人間による評価で構築し,長文回答の自動評価のさらなる進歩を可能にすることを示した。 IMO-Benchは、堅牢な数学的推論に向けてコミュニティを支援し、https://imobench.github.io/.com/でリリースすることを期待しています。

論文の概要: Towards Robust Mathematical Reasoning

関連論文リスト