A Temporal-Difference Approach to Policy Gradient Estimation
- URL: http://arxiv.org/abs/2202.02396v1
- Date: Fri, 4 Feb 2022 21:23:33 GMT
- Title: A Temporal-Difference Approach to Policy Gradient Estimation
- Authors: Samuele Tosatto, Andrew Patterson, Martha White, A. Rupam Mahmood
- Abstract summary: We propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy.
By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that sidesteps the distribution shift issue in a model-free way.
- Score: 27.749993205038148
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The policy gradient theorem (Sutton et al., 2000) prescribes the usage of a
cumulative discounted state distribution under the target policy to approximate
the gradient. Most algorithms based on this theorem, in practice, break this
assumption, introducing a distribution shift that can cause the convergence to
poor solutions. In this paper, we propose a new approach of reconstructing the
policy gradient from the start state without requiring a particular sampling
strategy. The policy gradient calculation in this form can be simplified in
terms of a gradient critic, which can be recursively estimated due to a new
Bellman equation of gradients. By using temporal-difference updates of the
gradient critic from an off-policy data stream, we develop the first estimator
that sidesteps the distribution shift issue in a model-free way. We prove that,
under certain realizability conditions, our estimator is unbiased regardless of
the sampling strategy. We empirically show that our technique achieves a
superior bias-variance trade-off and performance in presence of off-policy
samples.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Towards Provable Log Density Policy Gradient [6.0891236991406945]
Policy gradient methods are a vital ingredient behind the success of modern reinforcement learning.
In this work, we argue that this residual term is significant and correcting for it could potentially improve sample-complexity of reinforcement learning methods.
We propose log density gradient to estimate the policy gradient, which corrects for this residual error term.
arXiv Detail & Related papers (2024-03-03T20:09:09Z) - Optimization Landscape of Policy Gradient Methods for Discrete-time
Static Output Feedback [22.21598324895312]
This paper analyzes the optimization landscape inherent to policy gradient methods when applied to static output feedback control.
We derive novel findings regarding convergence (and nearly dimension-free rate) to stationary points for three policy gradient methods.
We provide proof that the vanilla policy gradient method exhibits linear convergence towards local minima when near such minima.
arXiv Detail & Related papers (2023-10-29T14:25:57Z) - A Policy Gradient Method for Confounded POMDPs [7.75007282943125]
We propose a policy gradient method for confounded partially observable Markov decision processes (POMDPs) with continuous state and observation spaces in the offline setting.
We first establish a novel identification result to non-parametrically estimate any history-dependent policy gradient under POMDPs using the offline data.
arXiv Detail & Related papers (2023-05-26T16:48:05Z) - The Role of Baselines in Policy Gradient Optimization [83.42050606055822]
We show that the emphstate value baseline allows on-policy.
emphnatural policy gradient (NPG) to converge to a globally optimal.
policy at an $O (1/t) rate gradient.
We find that the primary effect of the value baseline is to textbfreduce the aggressiveness of the updates rather than their variance.
arXiv Detail & Related papers (2023-01-16T06:28:00Z) - Policy Gradient for Continuing Tasks in Non-stationary Markov Decision
Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities.
We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy.
A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z) - Statistically Efficient Off-Policy Policy Gradients [80.42316902296832]
We consider the statistically efficient estimation of policy gradients from off-policy data.
We propose a meta-algorithm that achieves the lower bound without any parametric assumptions.
We establish guarantees on the rate at which we approach a stationary point when we take steps in the direction of our new estimated policy gradient.
arXiv Detail & Related papers (2020-02-10T18:41:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.