Fugu-MT 論文翻訳(概要): FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

論文の概要: FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

arxiv url: http://arxiv.org/abs/2605.10141v1
Date: Mon, 11 May 2026 07:51:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.617887
Title: FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models
Title（参考訳）: FormalRewardBench: 回帰モデルを証明する形式理論のベンチマーク
Authors: Zeynel A. Uluşan, Burak S. Akbudak, Can S. Erer, Gözde Gül Şahin,
Abstract要約: 我々はtextbfFormalRewardBenchを紹介します。これはLean 4.0で証明された形式的定理で報酬モデルを評価するための最初のベンチマークです。その結果,フロンティア LLM は最高性能 (59.8%) を達成し,特殊定理証明器は最低性能 (24.4%) を達成した。 textbfFormalRewardBenchを公開し、形式数学における報酬モデルの開発についてさらなる研究を奨励する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent neural theorem provers use reinforcement learning with verifiable rewards (RLVR), where proof assistants provide binary correctness signals. While verifiable rewards are cheap and scalable without reward hacking issues, they suffer from sparse credit assignment: models receive no learning signal from difficult problems where partial progress goes unrewarded. This motivates learned reward models that can evaluate proof quality beyond binary verification. However, comparing reward models is challenging since it typically requires expensive RL training ablations. To address this, we introduce \textbf{FormalRewardBench}, the first benchmark for evaluating reward models in formal theorem proving with Lean 4. Our benchmark consists of 250 preference pairs where correct proofs are paired with incorrect variants generated through five expert curated error injection strategies: forced mistakes, minimal single-point variations, verbose incorrect proofs, natural language justification, and Python code injection. We evaluate frontier LLMs (e.g., Claude Opus 4.5), judge LLMs (e.g., CompassJudger-1-14B), general-purpose LLMs (e.g., Qwen2.5-72B-Instruct), and specialized theorem proving models (e.g., DeepSeek-Prover-V2-7B). Our results reveal that frontier LLMs achieve the highest performance (59.8\%) while specialized theorem provers perform the worst (24.4\%), suggesting that theorem proving ability does not transfer to proof evaluation. We provide further insights on various error injection mechanisms, highlighting the challenging nature of most injection mechanisms. We release \textbf{FormalRewardBench} publicly to encourage more research on developing reward models in formal mathematics.
Abstract（参考訳）: 最近のニューラル定理証明者は、証明アシスタントが二項正当性信号を提供する検証可能な報酬(RLVR)を用いた強化学習を使用する。検証可能な報酬は、ハッキングの問題なく安価でスケーラブルだが、十分なクレジットの割り当てに悩まされている。これは、バイナリ検証を超えて証明品質を評価することができる、学習された報酬モデルへの動機付けである。しかしながら、報酬モデルの比較は、通常、高価なRLトレーニングアブリケーションを必要とするため、難しい。これを解決するために、私たちは、Lean 4.0で証明された形式的定理で報酬モデルを評価するための最初のベンチマークである \textbf{FormalRewardBench} を紹介します。我々のベンチマークは250の選好ペアで構成されており、そこでは正しい証明が5つの専門家がキュレートしたエラーインジェクション戦略(強制ミス、最小限の単一ポイントのバリエーション、冗長な不正確な証明、自然言語の正当性、Pythonコードインジェクション)によって生成される誤った変種とペアリングされる。我々は、フロンティア LLMs (e g , Claude Opus 4.5), judge LLMs (e g , CompassJudger-1-14B), general-purpose LLMs (e g , Qwen2.5-72B-Instruct), and special theorem proving model (e g , DeepSeek-Prover-V2-7B)を評価する。その結果、フロンティア LLM は最高性能 (59.8 %) を達成する一方、特殊定理証明器は最悪の性能 (24.4 %) を達成し、定理証明能力が証明評価に移行しないことが示唆された。我々は,多くのインジェクション機構の難易度を強調し,様々なエラーインジェクション機構に関するさらなる知見を提供する。我々は、形式数学における報酬モデルの開発に関するさらなる研究を促進するために、公然と \textbf{FormalRewardBench} をリリースする。

論文の概要: FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

関連論文リスト