Related papers: PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning

PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning

URL: http://arxiv.org/abs/2508.21104v3
Date: Fri, 19 Sep 2025 02:37:05 GMT
Title: PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning
Authors: Wenfeng Feng, Penghong Zhao, Guochao Jiang, Chuzhan Hao, Yuewei Zhang, Guohua Liu, Hao Wang,
Abstract summary: We propose PVPO, an efficient reinforcement learning method enhanced by an advantage reference anchor and data pre-sampling.<n>Our approach effectively corrects the cumulative bias introduced by intra-group comparisons and significantly reduces reliance on the number of rollouts during training.<n>Our approach not only demonstrates robust generalization across multiple tasks, but also exhibits scalable performance across models of varying scales.
Score: 6.050409262589219
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Critic-free reinforcement learning methods, particularly group policies, have attracted considerable attention for their efficiency in complex tasks. However, these methods rely heavily on multiple sampling and comparisons within the policy to estimate advantage, which may cause the policy to fall into local optimum and increase computational cost. To address these issues, we propose PVPO, an efficient reinforcement learning method enhanced by an advantage reference anchor and data pre-sampling. Specifically, we use the reference model to rollout in advance and employ the calculated reward score as a reference anchor. Our approach effectively corrects the cumulative bias introduced by intra-group comparisons and significantly reduces reliance on the number of rollouts during training. Meanwhile, the reference model can assess sample difficulty during data pre-sampling, enabling effective selection of high-gain data to improve training efficiency. Moreover, PVPO is orthogonal to other advanced critic-free RL algorithms, making it compatible with and complementary to these methods. Experiments conducted on nine datasets across two domains demonstrate that PVPO achieves State-Of-The-Art (SOTA) performance. Our approach not only demonstrates robust generalization across multiple tasks, but also exhibits scalable performance across models of varying scales.

Related papers

Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z)
Maximizing the efficiency of human feedback in AI alignment: a comparative analysis [1.561268797057701]
We explore alternative sampling and evaluation strategies for preference inference inReinforcement Learning from Human Feedback (RLHF)<n>Our best-performing method, Swiss InfoGain, employs a Swiss tournament system with a proxy mutual-information-gain pairing rule, which significantly outperforms all other methods in constrained annotation budgets.<n>Our experiments demonstrate that adaptive, resource-aware strategies reduce redundancy, enhance robustness, and yield statistically significant improvements in preference learning.
arXiv Detail & Related papers (2025-11-16T21:55:59Z)
Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning [52.97053840476386]
We show that well-designed behaviour policies can be used to collect off-policy data for provably lower variance return estimates.<n>We extend this key insight to the online reinforcement learning setting, where both policy evaluation and improvement are interleaved.
arXiv Detail & Related papers (2025-11-13T23:06:40Z)
GFRIEND: Generative Few-shot Reward Inference through EfficieNt DPO [3.189559302776161]
The ability to train high-performing reward models with few-shot data is critical for enhancing the efficiency and scalability of Reinforcement Learning from Human Feedback.<n>We propose a data augmentation and expansion framework that enables generative reward models trained on small datasets to achieve comparable performance to those trained on large-scale datasets.
arXiv Detail & Related papers (2025-06-10T16:37:13Z)
AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Momentum [36.105117202321544]
Reinforcement learning (RL) has emerged as an effective approach for enhancing the reasoning capabilities of large language models (LLMs)<n>Group relative advantage estimation has attracted considerable attention for eliminating the dependency on the value model.<n>We propose Advantage-Augmented Policy Optimization (AAPO), a novel RL algorithm that optimize the cross-entropy loss using advantages enhanced through a momentum-based estimation scheme.
arXiv Detail & Related papers (2025-05-20T12:13:44Z)
A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning [61.403275660120606]
Reinforcement learning (RL)-based fine-tuning has emerged as a powerful approach for aligning diffusion models with black-box objectives.<n>We propose leave-one-out PPO (LOOP), a novel RL for diffusion fine-tuning method.<n>Our results demonstrate that LOOP effectively improves diffusion models on various black-box objectives, and achieves a better balance between computational efficiency and performance.
arXiv Detail & Related papers (2025-03-02T13:43:53Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
Sample Dropout: A Simple yet Effective Variance Reduction Technique in Deep Policy Optimization [18.627233013208834]
We show that the use of importance sampling could introduce high variance in the objective estimate. We propose a technique called sample dropout to bound the estimation variance by dropping out samples when their ratio deviation is too high.
arXiv Detail & Related papers (2023-02-05T04:44:35Z)
Bootstrap Advantage Estimation for Policy Optimization in Reinforcement Learning [16.999444076456268]
This paper proposes an advantage estimation approach based on data augmentation for policy optimization. Our method uses data augmentation to compute a bootstrap advantage estimation. We observe that our method reduces the policy and the value loss better than the Generalized advantage estimation.
arXiv Detail & Related papers (2022-10-13T19:30:43Z)
DEALIO: Data-Efficient Adversarial Learning for Imitation from Observation [57.358212277226315]
In imitation learning from observation IfO, a learning agent seeks to imitate a demonstrating agent using only observations of the demonstrated behavior without access to the control signals generated by the demonstrator. Recent methods based on adversarial imitation learning have led to state-of-the-art performance on IfO problems, but they typically suffer from high sample complexity due to a reliance on data-inefficient, model-free reinforcement learning algorithms. This issue makes them impractical to deploy in real-world settings, where gathering samples can incur high costs in terms of time, energy, and risk. We propose a more data-efficient IfO algorithm
arXiv Detail & Related papers (2021-03-31T23:46:32Z)
Optimal Off-Policy Evaluation from Multiple Logging Policies [77.62012545592233]
We study off-policy evaluation from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling. We find the OPE estimator for multiple loggers with minimum variance for any instance, i.e., the efficient one.
arXiv Detail & Related papers (2020-10-21T13:43:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.