An Analysis of Measure-Valued Derivatives for Policy Gradients
- URL: http://arxiv.org/abs/2203.03917v1
- Date: Tue, 8 Mar 2022 08:26:31 GMT
- Title: An Analysis of Measure-Valued Derivatives for Policy Gradients
- Authors: Joao Carvalho and Jan Peters
- Abstract summary: We study a different type of gradient estimator - the Measure-Valued Derivative.
This estimator is unbiased, has low variance, and can be used with differentiable and non-differentiable function approximators.
We show that it can reach comparable performance with methods based on the likelihood-ratio or reparametrization tricks.
- Score: 37.241788708646574
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning methods for robotics are increasingly successful due
to the constant development of better policy gradient techniques. A precise
(low variance) and accurate (low bias) gradient estimator is crucial to face
increasingly complex tasks. Traditional policy gradient algorithms use the
likelihood-ratio trick, which is known to produce unbiased but high variance
estimates. More modern approaches exploit the reparametrization trick, which
gives lower variance gradient estimates but requires differentiable value
function approximators. In this work, we study a different type of stochastic
gradient estimator - the Measure-Valued Derivative. This estimator is unbiased,
has low variance, and can be used with differentiable and non-differentiable
function approximators. We empirically evaluate this estimator in the
actor-critic policy gradient setting and show that it can reach comparable
performance with methods based on the likelihood-ratio or reparametrization
tricks, both in low and high-dimensional action spaces. With this work, we want
to show that the Measure-Valued Derivative estimator can be a useful
alternative to other policy gradient estimators.
Related papers
- Compatible Gradient Approximations for Actor-Critic Algorithms [0.0]
We introduce an actor-critic algorithm that bypasses the need for such precision by employing a zerothorder approximation of the action-value gradient.
Empirical results demonstrate that our algorithm not only matches but frequently exceeds the performance of current state-of-the-art methods.
arXiv Detail & Related papers (2024-09-02T22:00:50Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Gradient Estimation with Discrete Stein Operators [44.64146470394269]
We introduce a variance reduction technique based on Stein operators for discrete distributions.
Our technique achieves substantially lower variance than state-of-the-art estimators with the same number of function evaluations.
arXiv Detail & Related papers (2022-02-19T02:22:23Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - An Empirical Analysis of Measure-Valued Derivatives for Policy Gradients [24.976352541745403]
We study a different type of gradient estimator: the Measure-Valued Derivative.
This estimator is unbiased, has low variance, and can be used with differentiable and non-differentiable function approximators.
We empirically evaluate this estimator in the actor-critic policy gradient setting and show that it can reach comparable performance with methods based on the likelihood-ratio or reparametrization tricks.
arXiv Detail & Related papers (2021-07-20T09:26:10Z) - Unifying Gradient Estimators for Meta-Reinforcement Learning via
Off-Policy Evaluation [53.83642844626703]
We provide a unifying framework for estimating higher-order derivatives of value functions, based on off-policy evaluation.
Our framework interprets a number of prior approaches as special cases and elucidates the bias and variance trade-off of Hessian estimates.
arXiv Detail & Related papers (2021-06-24T15:58:01Z) - Batch Reinforcement Learning with a Nonparametric Off-Policy Policy
Gradient [34.16700176918835]
Off-policy Reinforcement Learning holds the promise of better data efficiency.
Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates.
We propose a nonparametric Bellman equation, which can be solved in closed form.
arXiv Detail & Related papers (2020-10-27T13:40:06Z) - Deep Bayesian Quadrature Policy Optimization [100.81242753620597]
Deep Bayesian quadrature policy gradient (DBQPG) is a high-dimensional generalization of Bayesian quadrature for policy gradient estimation.
We show that DBQPG can substitute Monte-Carlo estimation in policy gradient methods, and demonstrate its effectiveness on a set of continuous control benchmarks.
arXiv Detail & Related papers (2020-06-28T15:44:47Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.