Fugu-MT 論文翻訳(概要): Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

論文の概要: Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

arxiv url: http://arxiv.org/abs/2605.00674v1
Date: Fri, 01 May 2026 13:56:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 17:43:28.977167
Title: Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs
Title（参考訳）: ベンチマークを超えて - MathArena による LLM を用いた数学評価プラットフォーム
Authors: Jasper Dekoninck, Nikola Jovanović, Tim Gehrunger, Kári Rögnvalddson, Ivo Petrov, Chenhao Sun, Martin Vechev,
Abstract要約: 我々は、その範囲を大幅に広げることで、オリジナルのMathArenaベンチマークを構築します。 MathArenaは現在、証明ベースの競争、研究レベルのarXiv問題、Leanでの正式な証明生成など、より広範なタスクをカバーしています。最強のモデルであるGPT-5.5は、2026年のアメリカ数学オリンピックで98%、研究レベルの質問で74%に達した。
参考スコア（独自算出の注目度）: 4.559742899048613
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are becoming increasingly capable mathematical collaborators, but static benchmarks are no longer sufficient for evaluating progress: they are often narrow in scope, quickly saturated, and rarely updated. This makes it hard to compare models reliably and track progress over time. Instead, we need evaluation platforms: continuously maintained systems that run, aggregate, and analyze evaluations across many benchmarks to give a comprehensive picture of model performance within a broad domain. In this work, we build on the original MathArena benchmark by substantially broadening its scope from final-answer olympiad problems to a continuously maintained evaluation platform for mathematical reasoning with LLMs. MathArena now covers a much wider range of tasks, including proof-based competitions, research-level arXiv problems, and formal proof generation in Lean. Additionally, we maintain a clear evaluation protocol for all models and regularly design new benchmarks as model capabilities improve to ensure that MathArena remains challenging. Notably, the strongest model, GPT-5.5, now reaches 98% on the 2026 USA Math Olympiad and 74% on research-level questions, showing that frontier models can now comfortably solve extremely challenging mathematical problems. This highlights the importance of continuously maintained evaluation platforms like MathArena to track the rapid progress of LLMs in mathematical reasoning.
Abstract（参考訳）: 大規模言語モデル(LLM)は、ますます有能な数学的コラボレータになりつつあるが、静的ベンチマークは進歩を評価するのに十分ではない。これにより、モデルを確実に比較し、時間とともに進捗を追跡するのが難しくなります。その代わり、評価プラットフォームが必要です — 広範囲なドメイン内でモデルパフォーマンスの全体像を提供するために、多数のベンチマークで評価を実行、集計、分析する継続的メンテナンスシステムが必要です。本研究では,従来のMathArenaベンチマークに基づいて,最終回答オリンピアード問題からLLMを用いた数学的推論のための連続的な評価プラットフォームまで,その範囲を大幅に広げた。 MathArenaは現在、証明ベースの競争、研究レベルのarXiv問題、Leanでの正式な証明生成など、より広範なタスクをカバーしています。さらに、すべてのモデルに対する明確な評価プロトコルを維持し、モデル機能が改善されるにつれて、新しいベンチマークを定期的に設計し、MathArenaが困難であることを保証する。特に、最強のモデルであるGPT-5.5は、2026年のアメリカ数学オリンピックで98%、研究レベルの質問で74%に達した。このことは、数学的推論におけるLLMの急速な進歩を追跡するために、MathArenaのような継続的な評価プラットフォームの重要性を強調している。

論文の概要: Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

関連論文リスト