Coordinate-wise Control Variates for Deep Policy Gradients
- URL: http://arxiv.org/abs/2107.04987v1
- Date: Sun, 11 Jul 2021 07:36:01 GMT
- Title: Coordinate-wise Control Variates for Deep Policy Gradients
- Authors: Yuanyi Zhong, Yuan Zhou, Jian Peng
- Abstract summary: The effect of vector-valued baselines for neural net policies is under-explored.
We show that lower variance can be obtained with such baselines than with the conventional scalar-valued baseline.
- Score: 23.24910014825916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The control variates (CV) method is widely used in policy gradient estimation
to reduce the variance of the gradient estimators in practice. A control
variate is applied by subtracting a baseline function from the state-action
value estimates. Then the variance-reduced policy gradient presumably leads to
higher learning efficiency. Recent research on control variates with deep
neural net policies mainly focuses on scalar-valued baseline functions. The
effect of vector-valued baselines is under-explored. This paper investigates
variance reduction with coordinate-wise and layer-wise control variates
constructed from vector-valued baselines for neural net policies. We present
experimental evidence suggesting that lower variance can be obtained with such
baselines than with the conventional scalar-valued baseline. We demonstrate how
to equip the popular Proximal Policy Optimization (PPO) algorithm with these
new control variates. We show that the resulting algorithm with proper
regularization can achieve higher sample efficiency than scalar control
variates in continuous control benchmarks.
Related papers
- Pathwise Gradient Variance Reduction with Control Variates in Variational Inference [2.1638817206926855]
Variational inference in Bayesian deep learning often involves computing the gradient of an expectation that lacks a closed-form solution.
In these cases, pathwise and score-function gradient estimators are the most common approaches.
Recent research suggests that even pathwise gradient estimators could benefit from variance reduction.
arXiv Detail & Related papers (2024-10-08T07:28:46Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Improving Deep Policy Gradients with Value Function Search [21.18135854494779]
This paper focuses on improving value approximation and analyzing the effects on Deep PG primitives.
We introduce a Value Function Search that employs a population of perturbed value networks to search for a better approximation.
Our framework does not require additional environment interactions, gradient computations, or ensembles.
arXiv Detail & Related papers (2023-02-20T18:23:47Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Regularly Updated Deterministic Policy Gradient Algorithm [11.57539530904012]
This paper proposes a Regularly Updated Deterministic (RUD) policy gradient algorithm for these problems.
This paper theoretically proves that the learning procedure with RUD can make better use of new data in replay buffer than the traditional procedure.
arXiv Detail & Related papers (2020-07-01T01:18:25Z) - Deep Bayesian Quadrature Policy Optimization [100.81242753620597]
Deep Bayesian quadrature policy gradient (DBQPG) is a high-dimensional generalization of Bayesian quadrature for policy gradient estimation.
We show that DBQPG can substitute Monte-Carlo estimation in policy gradient methods, and demonstrate its effectiveness on a set of continuous control benchmarks.
arXiv Detail & Related papers (2020-06-28T15:44:47Z) - Scalable Control Variates for Monte Carlo Methods via Stochastic
Optimization [62.47170258504037]
This paper presents a framework that encompasses and generalizes existing approaches that use controls, kernels and neural networks.
Novel theoretical results are presented to provide insight into the variance reduction that can be achieved, and an empirical assessment, including applications to Bayesian inference, is provided in support.
arXiv Detail & Related papers (2020-06-12T22:03:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.