VeRPO: Verifiable Dense Reward Policy Optimization for Code Generation
- URL: http://arxiv.org/abs/2601.03525v2
- Date: Fri, 09 Jan 2026 03:27:47 GMT
- Title: VeRPO: Verifiable Dense Reward Policy Optimization for Code Generation
- Authors: Longwen Wang, Xuan'er Wu, Xiaohui Hu, Yirui Liu, Yuankai Fan, Kaidong Yu, Qizhen Weng, Wei Xi, Xuelong Li,
- Abstract summary: We introduce textbfVeRPO (textbfVerifiable Dtextbfense textbfReward textbfPolicy textbfOptimization), a novel RL framework for code generation that synthesizes textitrobust and dense rewards fully grounded in verifiable execution feedback.<n>VeRPO consistently outperforms outcome-driven and RM-based baselines, achieving up to +8.83% gain in pass@1 with negligible time cost ( 0.02%) and zero
- Score: 43.206705536310245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Effective reward design is a central challenge in Reinforcement Learning (RL) for code generation. Mainstream pass/fail outcome rewards enforce functional correctness via executing unit tests, but the resulting sparsity limits potential performance gains. While recent work has explored external Reward Models (RM) to generate richer, continuous rewards, the learned RMs suffer from reward misalignment and prohibitive computational cost. In this paper, we introduce \textbf{VeRPO} (\textbf{V}erifiable D\textbf{e}nse \textbf{R}eward \textbf{P}olicy \textbf{O}ptimization), a novel RL framework for code generation that synthesizes \textit{robust and dense rewards fully grounded in verifiable execution feedback}. The core idea of VeRPO is constructing dense rewards from weighted partial success: by dynamically estimating the difficulty weight of each unit test based on the execution statistics during training, a dense reward is derived from the sum of weights of the passed unit tests. To solidify the consistency between partial success and end-to-end functional correctness, VeRPO further integrates the dense signal with global execution outcomes, establishing a robust and dense reward paradigm relying solely on verifiable execution feedback. Extensive experiments across diverse benchmarks and settings demonstrate that VeRPO consistently outperforms outcome-driven and RM-based baselines, achieving up to +8.83\% gain in pass@1 with negligible time cost (< 0.02\%) and zero GPU memory overhead.
Related papers
- CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning [57.24524263804788]
Code verifiers play a critical role in post-verification for LLM-based code generation.<n>Existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency.<n>We show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples.
arXiv Detail & Related papers (2026-01-30T10:33:29Z) - P2S: Probabilistic Process Supervision for General-Domain Reasoning Question Answering [51.04492568024515]
We introduce Probabilistic Process Supervision (P2S), a novel framework for fine-grained process rewards.<n>P2S provides fine-grained process rewards without requiring a separate reward model or human-annotated reasoning steps.
arXiv Detail & Related papers (2026-01-28T14:35:20Z) - From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation [52.62655622099456]
We propose reinforcement learning with verifiable reference-based rewards (RLVRR)<n>Instead of checking the final answer, RLVRR extracts an ordered linguistic signal from high-quality references (i.e., reward chain)<n>In this way, RLVRR decomposes rewards into two dimensions: content, which preserves deterministic core concepts, and style, which evaluates adherence to stylistic properties.
arXiv Detail & Related papers (2026-01-26T14:39:58Z) - Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering [28.35101062722637]
Group Relative Policy Optimization (GRPO) significantly enhances the reasoning performance of Large Language Models (LLMs)<n>We propose Latent-GRPO, a framework that derives intrinsic rewards directly from latent space geometry.<n>We show that our method maintains model performance while achieving a training speedup of over 2x compared to baselines.
arXiv Detail & Related papers (2026-01-13T10:55:08Z) - Hybrid Reward Normalization for Process-supervised Non-verifiable Agentic Tasks [12.31210445905605]
We introduce Principle Process Reward (PPR), an RL approach that unifies step-level assessment and outcome verification.<n>PPR achieves state-of-the-art performance across a wide range of benchmarks, demonstrating its impressive robustness and generalization.
arXiv Detail & Related papers (2025-09-29T23:44:55Z) - Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR [25.56828724912418]
Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming.<n>Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates.<n>We propose $textbfPACS$, a novel RLVR framework that achieves im$textbfP$licit $textbfA$ctor $textbfC$ritic coupling.
arXiv Detail & Related papers (2025-09-02T17:22:46Z) - Nested-ReFT: Efficient Reinforcement Learning for Large Language Model Fine-Tuning via Off-Policy Rollouts [25.205293698698867]
We introduce Nested-ReFT, where a subset of layers of the target model acts as the behavior model to generate off-policy completions during training.<n>Our theoretical analysis shows that Nested-ReFT yields unbiased gradient estimates with controlled variance.<n>Our empirical analysis demonstrates improved computational efficiency measured as tokens/sec across multiple math reasoning benchmarks and model sizes.
arXiv Detail & Related papers (2025-08-13T18:37:46Z) - Intra-Trajectory Consistency for Reward Modeling [67.84522106537274]
We develop an intra-trajectory consistency regularization to enforce that adjacent processes with higher next-token generation probability maintain more consistent rewards.<n>We show that the reward model trained with the proposed regularization induces better DPO-aligned policies and achieves better best-of-N (BON) inference-time verification results.
arXiv Detail & Related papers (2025-06-10T12:59:14Z) - Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards [11.149294285483782]
We propose a unified RLVR-based training paradigm that bridges the gap between non-verifiable tasks and verifiable rewards.<n>We introduce a writing-principle-based pairwise Generative Reward Model (GenRM) and a novel Bootstrapped Relative Policy Optimization (BRPO) algorithm.<n>Our approach empowers LLMs to develop robust writing capabilities without supervised fine-tuning.
arXiv Detail & Related papers (2025-05-30T14:34:57Z) - RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution [50.171320156632866]
Reinforcement learning from human feedback offers a promising approach to aligning large language models with human preferences.<n>Current reward models operate as sequence-to-one models, allocating a single, sparse, and delayed reward to an entire output sequence.<n>We propose a more fine-grained, token-level guidance approach for RL training.
arXiv Detail & Related papers (2024-11-13T02:45:21Z) - Beyond Simple Sum of Delayed Rewards: Non-Markovian Reward Modeling for Reinforcement Learning [44.770495418026734]
Reinforcement Learning (RL) empowers agents to acquire various skills by learning from reward signals.
Traditional methods assume the existence of underlying Markovian rewards and that the observed delayed reward is simply the sum of instance-level rewards.
We propose Composite Delayed Reward Transformer (CoDeTr), which incorporates a specialized in-sequence attention mechanism.
arXiv Detail & Related papers (2024-10-26T13:12:27Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.