Fugu-MT 論文翻訳(概要): Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection

論文の概要: Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection

arxiv url: http://arxiv.org/abs/2511.13027v1
Date: Mon, 17 Nov 2025 06:25:35 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-18 14:36:24.720474
Title: Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection
Title（参考訳）: 自然言語の数学的証明と選択のための生成検証のスケーリング
Authors: Sadegh Mahdavi, Branislav Kisacanin, Shubham Toshniwal, Wei Du, Ivan Moshkov, George Armstrong, Renjie Liao, Christos Thrampoulidis, Igor Gitman,
Abstract要約: 大規模言語モデルは、最終解答問題において顕著な成功を収めた。しかし、これらのソリューションの根底にある理由はしばしば欠陥がある。モデル性能のより信頼性の高い尺度を得るために,証明ベースと最終回答推論の両方を評価した。
参考スコア（独自算出の注目度）: 42.21636315733425
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models have achieved remarkable success on final-answer mathematical problems, largely due to the ease of applying reinforcement learning with verifiable rewards. However, the reasoning underlying these solutions is often flawed. Advancing to rigorous proof-based mathematics requires reliable proof verification capabilities. We begin by analyzing multiple evaluation setups and show that focusing on a single benchmark can lead to brittle or misleading conclusions. To address this, we evaluate both proof-based and final-answer reasoning to obtain a more reliable measure of model performance. We then scale two major generative verification methods (GenSelect and LLM-as-a-Judge) to millions of tokens and identify their combination as the most effective framework for solution verification and selection. We further show that the choice of prompt for LLM-as-a-Judge significantly affects the model's performance, but reinforcement learning can reduce this sensitivity. However, despite improving proof-level metrics, reinforcement learning does not enhance final-answer precision, indicating that current models often reward stylistic or procedural correctness rather than mathematical validity. Our results establish practical guidelines for designing and evaluating scalable proof-verification and selection systems.
Abstract（参考訳）: 大規模言語モデルは、検証可能な報酬で強化学習を適用することの容易さから、最終回答の数学的問題において顕著な成功を収めた。しかし、これらのソリューションの根底にある理由はしばしば欠陥がある。厳密な証明に基づく数学への適応には、信頼できる証明検証能力が必要である。まず、複数の評価設定を分析し、単一のベンチマークにフォーカスすることで、不安定な結果や誤解を招く結果につながることを示します。そこで我々は,モデル性能のより信頼性の高い指標を得るために,証明ベースと最終回答推論の両方を評価する。次に、GenSelect と LLM-as-a-Judge の2つの主要な生成検証手法を数百万のトークンに拡張し、それらの組み合わせをソリューション検証と選択の最も効果的なフレームワークとして同定する。さらに,LLM-as-a-Judgeのプロンプトの選択がモデルの性能に大きく影響することを示した。しかし、証明レベルの指標の改善にもかかわらず、強化学習は最終回答精度を向上しないため、現在のモデルは数学的妥当性よりもスタイリスティックまたは手続き的正当性に報いることが多い。本研究は,スケーラブルな証明検証・選択システムの設計・評価のための実践的ガイドラインを構築した。

論文の概要: Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection

関連論文リスト