Fugu-MT 論文翻訳(概要): Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

論文の概要: Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

arxiv url: http://arxiv.org/abs/2509.24981v1
Date: Mon, 29 Sep 2025 16:09:07 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:20.119758
Title: Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards
Title（参考訳）: LLM推論と検証リワードのためのランダムポリシー評価
Authors: Haoran He, Yuxiao Ye, Qingpeng Cai, Chen Hu, Binxing Jiao, Daxin Jiang, Ling Pan,
Abstract要約: 逆推論のためのランダムポリシー評価(ROVER)について紹介する。 ROVERは、一様政体Q値上のソフトマックスから作用をサンプリングする最小限だが高効率なRL法である。 textbfquality(textbf+8.2 on pass@1, textbf+16.8 on pass@256)と textbfdiversity(textbf+17.6%)の両方で優れたパフォーマンスを示している。
参考スコア（独自算出の注目度）: 47.557539197058496
License: http://creativecommons.org/licenses/by/4.0/
Abstract: RL with Verifiable Rewards (RLVR) has emerged as a promising paradigm for improving the reasoning abilities of large language models (LLMs). Current methods rely primarily on policy optimization frameworks like PPO and GRPO, which follow generalized policy iteration that alternates between evaluating the current policy's value and improving the policy based on evaluation. While effective, they often suffer from training instability and diversity collapse, requiring complex heuristic tricks and careful tuning. We observe that standard RLVR in math reasoning can be formalized as a specialized finite-horizon Markov Decision Process with deterministic state transitions, tree-structured dynamics, and binary terminal rewards. Though large in scale, the underlying structure is simpler than general-purpose control settings for which popular RL algorithms (e.g., PPO) were developed, suggesting that several sophisticated techniques in existing methods may be reduced or even omitted. Based on this insight, we prove a surprising result: the optimal action can be recovered from the Q-function of a fixed uniformly random policy, thereby bypassing the generalized policy iteration loop and its associated heuristics. We introduce Random Policy Valuation for Diverse Reasoning (ROVER) to translate this principle into a practical and scalable algorithm for LLM math reasoning, a minimalist yet highly effective RL method that samples actions from a softmax over these uniform-policy Q-values. ROVER preserves diversity throughout training, allowing sustained exploration of multiple valid pathways. Across multiple base models and standard math reasoning benchmarks, ROVER demonstrates superior performance in both \textbf{quality} (\textbf{+8.2} on pass@1, \textbf{+16.8} on pass@256) and \textbf{diversity} (\textbf{+17.6\%}), despite its radical simplification compared to strong, complicated existing methods.
Abstract（参考訳）: RL with Verifiable Rewards (RLVR)は、大規模言語モデル(LLM)の推論能力を改善するための有望なパラダイムとして登場した。現行の手法は主にPPOやGRPOのようなポリシー最適化フレームワークに依存しており、これは現在のポリシーの価値を評価し、評価に基づいてポリシーを改善するための一般的なポリシー反復に従っている。効果はあるものの、トレーニングの不安定性と多様性の崩壊に悩まされ、複雑なヒューリスティックなトリックと注意深いチューニングを必要とします。数学推論における標準RLVRは、決定論的状態遷移、木構造力学、二項終端報酬を含む特殊有限水平マルコフ決定過程として定式化することができる。大規模ではあるものの、一般的なRLアルゴリズム(例えば、PPO)が開発された汎用制御設定よりも基礎となる構造は単純であり、既存の手法におけるいくつかの高度な手法を減らしたり、省略したりすることができる。この知見に基づいて、最適動作は、固定されたランダムなポリシーのQ-関数から回復することができ、一般化されたポリシー反復ループとその関連するヒューリスティックスをバイパスすることができる。本稿では、この原理をLLM算術推論の実用的でスケーラブルなアルゴリズムに変換するために、Random Policy Valuation for Diverse Reasoning (ROVER)を紹介した。 ROVERはトレーニングを通じて多様性を保ち、複数の有効な経路の持続的な探索を可能にする。複数のベースモデルと標準的な数学推論ベンチマークにおいて、ROVERは、強靭で複雑な既存手法に比べて過激な単純化にもかかわらず、パス@1における \textbf{quality} (\textbf{+8.2})、パス@256における \textbf{+16.8})、および \textbf{diversity} (\textbf{+17.6\%})の両方において優れた性能を示す。

論文の概要: Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

関連論文リスト