Fugu-MT 論文翻訳(概要): VIMPO: Value-Implicit Policy Optimization for LLMs

論文の概要: VIMPO: Value-Implicit Policy Optimization for LLMs

arxiv url: http://arxiv.org/abs/2606.20008v1
Date: Thu, 18 Jun 2026 09:44:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-19 18:23:39.773875
Title: VIMPO: Value-Implicit Policy Optimization for LLMs
Title（参考訳）: VIMPO: LLMの値型ポリシー最適化
Authors: Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song, Xuandong Zhao,
Abstract要約: GRPOのようなグループ相対的手法は、批評家の訓練を避けるが、典型的には全てのトークンに軌道レベルの利点を割り当てる。アクター批判的手法は、より密集した学習信号を提供するが、学習価値関数を自身のトレーニング不安定性で要求する。本稿では,KL-正規化強化学習の最適条件からポリシ実装値関数を導出する,批判のないポリシ最適化手法であるVIMPOを紹介する。
参考スコア（独自算出の注目度）: 106.88933849641272
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning with verifiable rewards has become a central tool for improving the reasoning ability of large language models, but current methods face a trade-off between simplicity and credit assignment. Group-relative methods such as GRPO avoid training a critic, but typically assign a trajectory-level advantage to every token. Actor-critic methods provide denser learning signals, but require a learned value function with its own training instability. We introduce VIMPO, a critic-free policy optimization method that derives a policy-implied value function from the optimality conditions of KL-regularized reinforcement learning. For autoregressive generation, the resulting value recurrence can be written in terms of policy-reference log-ratios and anchored by the terminal condition that no future reward remains at the end of a trajectory. This gives a simple value loss that incorporates outcome-level verifiable rewards without training a critic. The same derivation also yields a critic-free actor advantage, allowing VIMPO to separate reward incorporation through the value loss from policy improvement through a PPO-style actor update. On mathematical RLVR benchmarks, VIMPO improves over GRPO across MATH-500, AIME 2024, AIME 2025, and OlympiadBench, with especially larger gains on competition-style evaluations. Under noisy rewards, VIMPO retains a consistent advantage over GRPO, suggesting that policy-implied value optimization can provide finer credit assignment while preserving the practical simplicity of critic-free training.
Abstract（参考訳）: 検証可能な報酬を伴う強化学習は、大きな言語モデルの推論能力を向上させる中心的なツールとなっているが、現在の手法は単純さと信用代入のトレードオフに直面している。 GRPOのようなグループ相対的手法は、批評家の訓練を避けるが、典型的には全てのトークンに軌道レベルの利点を割り当てる。アクター批判的手法は、より密集した学習信号を提供するが、学習価値関数を自身のトレーニング不安定性で要求する。そこで我々は,KL正規化強化学習の最適条件からポリシ実装値関数を導出する,批判のないポリシ最適化手法であるVIMPOを紹介する。自己回帰生成では、結果の値の再帰はポリシー参照の対数比で記述でき、終端条件によって固定され、将来の報酬は軌道の終端に残らない。これは、批評家を訓練せずに結果レベルの検証可能な報酬を組み込む、単純な価値損失を与える。同様の派生法は、PPOスタイルのアクター更新を通じてポリシーの改善による価値損失を通じて、VIMPOが報酬を分離できるという、批判のないアクターの優位性も得る。数学的なRLVRベンチマークでは、VIMPOは、MATH-500、AIME 2024、AIME 2025、OlympiadBenchのGRPOよりも改善され、特に競技スタイルの評価が向上した。騒々しい報奨の下では、VIMPOはGRPOに対して一貫した優位性を維持しており、ポリシーにより実装された価値最適化は、批判のないトレーニングの実践的単純さを維持しながら、より詳細なクレジット割り当てを提供できることを示唆している。

論文の概要: VIMPO: Value-Implicit Policy Optimization for LLMs

関連論文リスト