Fugu-MT 論文翻訳(概要): VeRPO: Verifiable Dense Reward Policy Optimization for Code Generation

論文の概要: VeRPO: Verifiable Dense Reward Policy Optimization for Code Generation

arxiv url: http://arxiv.org/abs/2601.03525v2
Date: Fri, 09 Jan 2026 03:27:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-12 13:49:32.382882
Title: VeRPO: Verifiable Dense Reward Policy Optimization for Code Generation
Title（参考訳）: VeRPO: コード生成のための検証可能なDense Rewardポリシー最適化
Authors: Longwen Wang, Xuan'er Wu, Xiaohui Hu, Yirui Liu, Yuankai Fan, Kaidong Yu, Qizhen Weng, Wei Xi, Xuelong Li,
Abstract要約: textbfVeRPO (textbf Verifiable Dtextbfense textbfReward textbfPolicy textbfOptimization) は,テキストイトラバストと高密度報酬を合成し,検証された実行フィードバックに完全に根ざしたコード生成のための新しいRLフレームワークである。 VeRPOは結果駆動のベースラインとRMベースのベースラインを一貫して上回り、許容しない時間コスト(0.02%)とゼロのパス@1で+8.83%のゲインを達成している。
参考スコア（独自算出の注目度）: 43.206705536310245
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Effective reward design is a central challenge in Reinforcement Learning (RL) for code generation. Mainstream pass/fail outcome rewards enforce functional correctness via executing unit tests, but the resulting sparsity limits potential performance gains. While recent work has explored external Reward Models (RM) to generate richer, continuous rewards, the learned RMs suffer from reward misalignment and prohibitive computational cost. In this paper, we introduce \textbf{VeRPO} (\textbf{V}erifiable D\textbf{e}nse \textbf{R}eward \textbf{P}olicy \textbf{O}ptimization), a novel RL framework for code generation that synthesizes \textit{robust and dense rewards fully grounded in verifiable execution feedback}. The core idea of VeRPO is constructing dense rewards from weighted partial success: by dynamically estimating the difficulty weight of each unit test based on the execution statistics during training, a dense reward is derived from the sum of weights of the passed unit tests. To solidify the consistency between partial success and end-to-end functional correctness, VeRPO further integrates the dense signal with global execution outcomes, establishing a robust and dense reward paradigm relying solely on verifiable execution feedback. Extensive experiments across diverse benchmarks and settings demonstrate that VeRPO consistently outperforms outcome-driven and RM-based baselines, achieving up to +8.83\% gain in pass@1 with negligible time cost (< 0.02\%) and zero GPU memory overhead.
Abstract（参考訳）: 効果的な報酬設計は、コード生成のための強化学習(RL)における中心的な課題である。メインストリームのパス/フェイル結果の報酬は、単体テストの実行によって機能的正しさを強制するが、結果として、パフォーマンスが向上する可能性を制限する。最近の研究は、より豊かで継続的な報酬を生み出すために外部リワードモデル(RM)を探索してきたが、学習されたRMは報酬の不調整と禁忌な計算コストに悩まされている。本稿では、コード生成のための新しいRLフレームワークである \textbf{VeRPO} (\textbf{V}erifiable D\textbf{e}nse \textbf{R}eward \textbf{P}olicy \textbf{O}ptimization)を紹介する。トレーニング中の実行統計に基づいて各単体テストの難易度を動的に推定することにより、通過した単体テストの重みの和から重み付け報酬を導出する。部分的な成功とエンドツーエンドの機能的正しさの整合性を確立するため、VeRPOはさらに、高密度信号をグローバルな実行結果に統合し、検証可能な実行フィードバックのみに依存する堅牢で高密度な報酬パラダイムを確立する。さまざまなベンチマークと設定にわたる大規模な実験により、VeRPOは結果駆動とRMベースのベースラインを一貫して上回り、無視できる時間コスト(< 0.02\%)とGPUメモリオーバーヘッドゼロのpass@1で+8.83\%のゲインを達成した。

論文の概要: VeRPO: Verifiable Dense Reward Policy Optimization for Code Generation

関連論文リスト