Fugu-MT 論文翻訳(概要): Proof-RM: A Scalable and Generalizable Reward Model for Math Proof

論文の概要: Proof-RM: A Scalable and Generalizable Reward Model for Math Proof

arxiv url: http://arxiv.org/abs/2602.02377v1
Date: Mon, 02 Feb 2026 17:42:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:34.328768
Title: Proof-RM: A Scalable and Generalizable Reward Model for Math Proof
Title（参考訳）: Proof-RM: 数学証明のためのスケーラブルで一般化可能なリワードモデル
Authors: Haotong Yang, Zitong Wang, Shijia Kang, Siqi Yang, Wenkai Yu, Xu Niu, Yike Sun, Yi Hu, Zhouchen Lin, Muhan Zhang,
Abstract要約: 大規模言語モデル(LLM)は,*検証リワード*(RLVR)を用いた強化学習を通じて,強力な数学推論能力を示した。多くの先進的な数学的問題は証明ベースであり、単純な解マッチングによって証明の真性を決定するための保証された方法はない。自動検証を実現するには、完全な証明プロセスを確実に評価できるリワードモデル(RM)が必要である。
参考スコア（独自算出の注目度）: 67.53066972145183
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Large Language Models (LLMs) have demonstrated strong math reasoning abilities through Reinforcement Learning with *Verifiable Rewards* (RLVR), many advanced mathematical problems are proof-based, with no guaranteed way to determine the authenticity of a proof by simple answer matching. To enable automatic verification, a Reward Model (RM) capable of reliably evaluating full proof processes is required. In this work, we design a *scalable* data-construction pipeline that, with minimal human effort, leverages LLMs to generate a large quantity of high-quality "**question-proof-check**" triplet data. By systematically varying problem sources, generation methods, and model configurations, we create diverse problem-proof pairs spanning multiple difficulty levels, linguistic styles, and error types, subsequently filtered through hierarchical human review for label alignment. Utilizing these data, we train a proof-checking RM, incorporating additional process reward and token weight balance to stabilize the RL process. Our experiments validate the model's scalability and strong performance from multiple perspectives, including reward accuracy, generalization ability and test-time guidance, providing important practical recipes and tools for strengthening LLM mathematical capabilities.
Abstract（参考訳）: 大規模言語モデル (LLMs) は、強化学習 (Reinforcement Learning with *Verifiable Rewards* (RLVR) を通じて強力な数学推論能力を示したが、多くの高度な数学問題は証明に基づくものであり、単純な解マッチングによって証明の真正性を決定することは保証されていない。自動検証を実現するには、完全な証明プロセスを確実に評価できるリワードモデル(RM)が必要である。本研究では,人間の努力を最小限に抑えつつ,LLMを活用して大量の高品質な“**question-proof-check**”データを生成する,*scalable*データ構築パイプラインを設計する。問題ソース,生成方法,モデル構成を体系的に変化させることで,複数の難易度,言語スタイル,エラータイプにまたがる多様な問題対を生成し,階層的人間によるラベルアライメントのレビューを通じてフィルタリングする。これらのデータを利用して、RLプロセスの安定化のために、追加のプロセス報酬とトークン重量バランスを組み込んだ実証チェックRMを訓練する。実験では,報奨精度,一般化能力,テストタイムガイダンスなど,複数の観点からモデルのスケーラビリティと性能を検証し,LLMの数学的能力を高めるための重要な実践的レシピとツールを提供する。

論文の概要: Proof-RM: A Scalable and Generalizable Reward Model for Math Proof

関連論文リスト