Related papers: A Temporal-Difference Approach to Policy Gradient Estimation

A Temporal-Difference Approach to Policy Gradient Estimation

URL: http://arxiv.org/abs/2202.02396v1
Date: Fri, 4 Feb 2022 21:23:33 GMT
Title: A Temporal-Difference Approach to Policy Gradient Estimation
Authors: Samuele Tosatto, Andrew Patterson, Martha White, A. Rupam Mahmood
Abstract summary: We propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy. By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that sidesteps the distribution shift issue in a model-free way.
Score: 27.749993205038148
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The policy gradient theorem (Sutton et al., 2000) prescribes the usage of a cumulative discounted state distribution under the target policy to approximate the gradient. Most algorithms based on this theorem, in practice, break this assumption, introducing a distribution shift that can cause the convergence to poor solutions. In this paper, we propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy. The policy gradient calculation in this form can be simplified in terms of a gradient critic, which can be recursively estimated due to a new Bellman equation of gradients. By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that sidesteps the distribution shift issue in a model-free way. We prove that, under certain realizability conditions, our estimator is unbiased regardless of the sampling strategy. We empirically show that our technique achieves a superior bias-variance trade-off and performance in presence of off-policy samples.

Related papers

Analysis of On-policy Policy Gradient Methods under the Distribution Mismatch [5.684512532326261]
We analyze the impact of a distribution mismatch on the policy gradient methods. Our findings offer new insights into the robustness of policy gradient methods as well as the gap between theoretical foundations and practical implementations.
arXiv Detail & Related papers (2025-03-28T08:52:41Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
Towards Provable Log Density Policy Gradient [6.0891236991406945]
Policy gradient methods are a vital ingredient behind the success of modern reinforcement learning. In this work, we argue that this residual term is significant and correcting for it could potentially improve sample-complexity of reinforcement learning methods. We propose log density gradient to estimate the policy gradient, which corrects for this residual error term.
arXiv Detail & Related papers (2024-03-03T20:09:09Z)
Optimization Landscape of Policy Gradient Methods for Discrete-time Static Output Feedback [22.21598324895312]
This paper analyzes the optimization landscape inherent to policy gradient methods when applied to static output feedback control. We derive novel findings regarding convergence (and nearly dimension-free rate) to stationary points for three policy gradient methods. We provide proof that the vanilla policy gradient method exhibits linear convergence towards local minima when near such minima.
arXiv Detail & Related papers (2023-10-29T14:25:57Z)
A Policy Gradient Method for Confounded POMDPs [7.75007282943125]
We propose a policy gradient method for confounded partially observable Markov decision processes (POMDPs) with continuous state and observation spaces in the offline setting. We first establish a novel identification result to non-parametrically estimate any history-dependent policy gradient under POMDPs using the offline data.
arXiv Detail & Related papers (2023-05-26T16:48:05Z)
The Role of Baselines in Policy Gradient Optimization [83.42050606055822]
We show that the emphstate value baseline allows on-policy. emphnatural policy gradient (NPG) to converge to a globally optimal. policy at an $O (1/t) rate gradient. We find that the primary effect of the value baseline is to textbfreduce the aggressiveness of the updates rather than their variance.
arXiv Detail & Related papers (2023-01-16T06:28:00Z)
Policy Gradient for Continuing Tasks in Non-stationary Markov Decision Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities. We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy. A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z)
Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous. In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist. We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z)
Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
Statistically Efficient Off-Policy Policy Gradients [80.42316902296832]
We consider the statistically efficient estimation of policy gradients from off-policy data. We propose a meta-algorithm that achieves the lower bound without any parametric assumptions. We establish guarantees on the rate at which we approach a stationary point when we take steps in the direction of our new estimated policy gradient.
arXiv Detail & Related papers (2020-02-10T18:41:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.