Related papers: Improving Value Estimation Critically Enhances Vanilla Policy Gradient

Improving Value Estimation Critically Enhances Vanilla Policy Gradient

URL: http://arxiv.org/abs/2505.19247v1
Date: Sun, 25 May 2025 17:54:32 GMT
Title: Improving Value Estimation Critically Enhances Vanilla Policy Gradient
Authors: Tao Wang, Ruipeng Zhang, Sicun Gao,
Abstract summary: Modern policy gradient algorithms, such as TRPO and PPO, outperform vanilla policy gradient in many RL tasks.<n>We show that the more critical factor is the enhanced value estimation accuracy from more value update steps in each iteration.
Score: 18.67760937266409
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern policy gradient algorithms, such as TRPO and PPO, outperform vanilla policy gradient in many RL tasks. Questioning the common belief that enforcing approximate trust regions leads to steady policy improvement in practice, we show that the more critical factor is the enhanced value estimation accuracy from more value update steps in each iteration. To demonstrate, we show that by simply increasing the number of value update steps per iteration, vanilla policy gradient itself can achieve performance comparable to or better than PPO in all the standard continuous control benchmark environments. Importantly, this simple change to vanilla policy gradient is significantly more robust to hyperparameter choices, opening up the possibility that RL algorithms may still become more effective and easier to use.

Related papers

Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z)
A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization [58.116300485427764]
Reinforcement learning post-training can elicit reasoning behaviors in large language models.<n> token-level correction often leads to unstable training dynamics when the degree of off-policyness is large.<n>We propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO)
arXiv Detail & Related papers (2026-01-30T08:47:19Z)
Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning [48.34492357368989]
We propose a primal-dual framework that supports stable on-policy learning and enables principled off-policy data reuse.<n>$R2VPO$ achieves superior performance with average relative gains of up to 17% over strong clipping-based baselines.
arXiv Detail & Related papers (2026-01-06T14:01:42Z)
Relative Entropy Pathwise Policy Optimization [56.86405621176669]
We show how to construct a value-gradient driven, on-policy algorithm that allow training Q-value models purely from on-policy data.<n>We propose Relative Entropy Pathwise Policy Optimization (REPPO), an efficient on-policy algorithm that combines the sample-efficiency of pathwise policy gradients with the simplicity and minimal memory footprint of standard on-policy learning.
arXiv Detail & Related papers (2025-07-15T06:24:07Z)
EXPO: Stable Reinforcement Learning with Expressive Policies [74.30151915786233]
We propose a sample-efficient online reinforcement learning algorithm to maximize value with two parameterized policies.<n>Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods.
arXiv Detail & Related papers (2025-07-10T17:57:46Z)
Value Improved Actor Critic Algorithms [5.301318117172143]
We extend the standard framework of actor critic algorithms with value-improvement.<n>We prove that this approach converges in the popular analysis scheme of Generalized Policy Iteration.<n> Empirically, incorporating value-improvement into the popular off-policy actor-critic algorithms TD3 and SAC significantly improves or matches performance over their respective baselines.
arXiv Detail & Related papers (2024-06-03T15:24:15Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
Clipped-Objective Policy Gradients for Pessimistic Policy Optimization [3.2996723916635275]
Policy gradient methods seek to produce monotonic improvement through bounded changes in policy outputs. In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective. We show that the clipped-objective policy gradient (COPG) objective is on average "pessimistic" compared to both the PPO objective and (2) this pessimism promotes enhanced exploration.
arXiv Detail & Related papers (2023-11-10T03:02:49Z)
Improving Deep Policy Gradients with Value Function Search [21.18135854494779]
This paper focuses on improving value approximation and analyzing the effects on Deep PG primitives. We introduce a Value Function Search that employs a population of perturbed value networks to search for a better approximation. Our framework does not require additional environment interactions, gradient computations, or ensembles.
arXiv Detail & Related papers (2023-02-20T18:23:47Z)
The Role of Baselines in Policy Gradient Optimization [83.42050606055822]
We show that the emphstate value baseline allows on-policy. emphnatural policy gradient (NPG) to converge to a globally optimal. policy at an $O (1/t) rate gradient. We find that the primary effect of the value baseline is to textbfreduce the aggressiveness of the updates rather than their variance.
arXiv Detail & Related papers (2023-01-16T06:28:00Z)
Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms [10.356356383401566]
In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. We discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target. We introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in $mathcalO(t-1)$ under classic assumptions.
arXiv Detail & Related papers (2022-02-15T15:04:10Z)
Offline RL Without Off-Policy Evaluation [49.11859771578969]
We show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well. This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark.
arXiv Detail & Related papers (2021-06-16T16:04:26Z)
Provably Good Batch Reinforcement Learning Without Great Exploration [51.51462608429621]
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks. Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes. We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
arXiv Detail & Related papers (2020-07-16T09:25:54Z)
Deep Bayesian Quadrature Policy Optimization [100.81242753620597]
Deep Bayesian quadrature policy gradient (DBQPG) is a high-dimensional generalization of Bayesian quadrature for policy gradient estimation. We show that DBQPG can substitute Monte-Carlo estimation in policy gradient methods, and demonstrate its effectiveness on a set of continuous control benchmarks.
arXiv Detail & Related papers (2020-06-28T15:44:47Z)
BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy. We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent. We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.