Fugu-MT 論文翻訳(概要): LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches

論文の概要: LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches

arxiv url: http://arxiv.org/abs/2604.01754v1
Date: Thu, 02 Apr 2026 08:22:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-03 14:21:10.609431
Title: LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches
Title（参考訳）: LiveMathematicianBench: Proof Sketchesを使った数学レベル推論のためのライブベンチマーク
Authors: Linyang He, Qiyao Yu, Hanze Dong, Baohao Liao, Xinxing Xu, Micah Goldblum, Jiang Bian, Nima Mesgarani,
Abstract要約: 研究レベルの数学的推論のための動的多重選択ベンチマークであるLiveMathematicianBenchを提案する。新たに発表された定理で評価を基礎づけることで、記憶されたパターンを超えた現実的なテストベッドを提供する。このパイプラインは、高レベルな証明戦略を使用して、妥当だが無効な解選択を構築する。
参考スコア（独自算出の注目度）: 61.30693283718321
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Mathematical reasoning is a hallmark of human intelligence, and whether large language models (LLMs) can meaningfully perform it remains a central question in artificial intelligence and cognitive science. As LLMs are increasingly integrated into scientific workflows, rigorous evaluation of their mathematical capabilities becomes a practical necessity. Existing benchmarks are limited by synthetic settings and data contamination. We present LiveMathematicianBench, a dynamic multiple-choice benchmark for research-level mathematical reasoning built from recent arXiv papers published after model training cutoffs. By grounding evaluation in newly published theorems, it provides a realistic testbed beyond memorized patterns. The benchmark introduces a thirteen-category logical taxonomy of theorem types (e.g., implication, equivalence, existence, uniqueness), enabling fine-grained evaluation across reasoning forms. It employs a proof-sketch-guided distractor pipeline that uses high-level proof strategies to construct plausible but invalid answer choices reflecting misleading proof directions, increasing sensitivity to genuine understanding over surface-level matching. We also introduce a substitution-resistant mechanism to distinguish answer recognition from substantive reasoning. Evaluation shows the benchmark is far from saturated: Gemini-3.1-pro-preview, the best model, achieves only 43.5%. Under substitution-resistant evaluation, accuracy drops sharply: GPT-5.4 scores highest at 30.6%, while Gemini-3.1-pro-preview falls to 17.6%, below the 20% random baseline. A dual-mode protocol reveals that proof-sketch access yields consistent accuracy gains, suggesting models can leverage high-level proof strategies for reasoning. Overall, LiveMathematicianBench offers a scalable, contamination-resistant testbed for studying research-level mathematical reasoning in LLMs.
Abstract（参考訳）: 数学的推論は人間の知能の目印であり、大きな言語モデル(LLM)が有意義に実行できるかは、人工知能と認知科学の中心的な問題である。 LLMが科学的なワークフローに統合されるにつれて、その数学的能力の厳密な評価が現実的に必要となる。既存のベンチマークは、合成設定とデータ汚染によって制限されている。我々は,最近のarXiv論文から構築された研究レベルの数学的推論のための動的多重選択ベンチマークであるLiveMathematicianBenchを紹介する。新たに発表された定理で評価を基礎づけることで、記憶されたパターンを超えた現実的なテストベッドを提供する。このベンチマークでは、定理型(例えば、含意、等価性、存在、特異性)の13カテゴリの論理分類を導入し、推論形式全体にわたってきめ細かい評価を可能にした。このパイプラインは、高レベルな証明戦略を用いて、誤解を招く証明方向を反映した、妥当だが無効な回答選択を構築し、表面レベルのマッチングに対する真の理解に対する感度を高める。また,回答認識と実体的推論を区別する置換耐性機構も導入した。 Gemini-3.1-pro-previewは最高のモデルであり、43.5%しか達成していない。 GPT-5.4は30.6%、Gemini-3.1-pro-previewは20%のランダムベースライン以下17.6%である。デュアルモードプロトコルは、証明スケッチアクセスが一貫した精度向上をもたらすことを明らかにし、モデルが推論のために高いレベルの証明戦略を活用できることを示唆する。全体として、LiveMathematicianBenchはLLMの研究レベルの数学的推論を研究するために、スケーラブルで汚染に強いテストベッドを提供する。

論文の概要: LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches

関連論文リスト