Fugu-MT 論文翻訳(概要): JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

論文の概要: JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

arxiv url: http://arxiv.org/abs/2604.25419v1
Date: Tue, 28 Apr 2026 09:29:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-29 16:49:17.795647
Title: JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR
Title（参考訳）: JURY-RL:投票提案、ラベルなしRLVRの証明
Authors: Xinjie Chen, Biao Fu, Jing Wu, Guoxin Chen, Xinggao Liu, Dayiheng Liu, Minpeng Liao,
Abstract要約: JURY-RLはラベルのないRLVRフレームワークで、報酬処理から回答提案を分離する。数学的推論ベンチマークにおいて、ラベルなしのベースラインを一貫して上回る。 Pass@1パフォーマンスは、教師付き地道トレーニングに匹敵する。
参考スコア（独自算出の注目度）: 39.03968285406107
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications. In machine-checkable domains, label-free alternatives such as majority voting or LLM-as-a-judge remove annotation cost but can introduce false positives that destabilize training. We introduce JURY-RL, a label-free RLVR framework that decouples answer proposal from reward disposal: votes from model rollouts propose a candidate answer, and a formal verifier determines whether that candidate can receive positive reward. Concretely, only rollouts matching the plurality-voted answer are rewarded when that answer is successfully verified in Lean. When verification is inconclusive, we invoke ResZero (Residual-Zero), a fallback reward that discards the unverified plurality proposal and redistributes a zero-mean, variance-preserving signal over the residual answers. This design maintains a stable optimization gradient without reinforcing unverifiable consensus. Across three backbone models trained on mathematical data, JURY-RL consistently outperforms other label-free baselines on mathematical reasoning benchmarks and transfers competitively to code generation and general benchmarks. It attains pass@1 performance comparable to supervised ground-truth training, with superior generalization demonstrated by higher pass@k and response diversity.
Abstract（参考訳）: 検証可能な報酬付き強化学習(RLVR)は、大きな言語モデル(LLM)の推論を強化するが、標準的なRLVRは、しばしば人間による注釈付き回答や、注意深くキュレートされた報酬仕様に依存する。マシンチェック可能なドメインでは、多数決やLCM-as-a-judgeのようなラベルなしの代替手段はアノテーションのコストを削減できるが、トレーニングを不安定にする偽陽性を導入することができる。 JURY-RLはラベルフリーなRLVRフレームワークで、モデルロールアウトからの票が候補回答を提案し、その候補が肯定的な報酬を受けることができるかどうかを定式検証する。具体的には、複数の投票された回答に一致するロールアウトだけが、その回答がLeanで成功したときに報われる。 ResZero(Residual-Zero)は、未検証の複数の提案を破棄し、残解上のゼロ平均分散保存信号を再分割するフォールバック報酬である。この設計は、検証不能なコンセンサスを補強することなく、安定した最適化勾配を維持する。数学的データに基づいてトレーニングされた3つのバックボーンモデルの中で、JURY-RLは、数学的推論ベンチマークやコード生成や一般的なベンチマークへの競合的な転送において、他のラベルのないベースラインを一貫して上回っている。教師付き地道トレーニングに匹敵するpass@1パフォーマンスを実現し、より高いpass@kとレスポンスの多様性によって優れた一般化が示される。

論文の概要: JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

関連論文リスト