A Policy Gradient Method for Confounded POMDPs
- URL: http://arxiv.org/abs/2305.17083v2
- Date: Fri, 1 Dec 2023 02:21:35 GMT
- Title: A Policy Gradient Method for Confounded POMDPs
- Authors: Mao Hong, Zhengling Qi, Yanxun Xu
- Abstract summary: We propose a policy gradient method for confounded partially observable Markov decision processes (POMDPs) with continuous state and observation spaces in the offline setting.
We first establish a novel identification result to non-parametrically estimate any history-dependent policy gradient under POMDPs using the offline data.
- Score: 7.75007282943125
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a policy gradient method for confounded partially
observable Markov decision processes (POMDPs) with continuous state and
observation spaces in the offline setting. We first establish a novel
identification result to non-parametrically estimate any history-dependent
policy gradient under POMDPs using the offline data. The identification enables
us to solve a sequence of conditional moment restrictions and adopt the min-max
learning procedure with general function approximation for estimating the
policy gradient. We then provide a finite-sample non-asymptotic bound for
estimating the gradient uniformly over a pre-specified policy class in terms of
the sample size, length of horizon, concentratability coefficient and the
measure of ill-posedness in solving the conditional moment restrictions.
Lastly, by deploying the proposed gradient estimation in the gradient ascent
algorithm, we show the global convergence of the proposed algorithm in finding
the history-dependent optimal policy under some technical conditions. To the
best of our knowledge, this is the first work studying the policy gradient
method for POMDPs under the offline setting.
Related papers
- Strongly-polynomial time and validation analysis of policy gradient methods [3.722665817361884]
This paper proposes a novel termination criterion, termed the advantage gap function, for finite state and action Markov decision processes (MDP) and reinforcement learning (RL)
By incorporating this advantage gap function into the design of step size rules, we deriving a new linear rate of convergence that is independent of the stationary state distribution of the optimal policy.
This is the first time that such strong convergence properties have been established for policy gradient methods.
arXiv Detail & Related papers (2024-09-28T18:56:48Z) - Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs [82.34567890576423]
We develop a deterministic policy gradient primal-dual method to find an optimal deterministic policy with non-asymptotic convergence.
We prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair.
To the best of our knowledge, this appears to be the first work that proposes a deterministic policy search method for continuous-space constrained MDPs.
arXiv Detail & Related papers (2024-08-19T14:11:04Z) - Elementary Analysis of Policy Gradient Methods [3.468656086349638]
This paper focuses on the discounted MDP setting and conduct a systematic study of the aforementioned policy optimization methods.
Several novel results are presented, including 1) global linear convergence of projected policy gradient for any constant step size, 2) sublinear convergence of softmax policy gradient for any constant step size, 3) global linear convergence of softmax natural policy gradient for any constant step size, 4) global linear convergence of entropy regularized softmax policy gradient for a wider range of constant step sizes than existing result, 5) tight local linear convergence rate of entropy regularized natural policy gradient, and 6) a new and concise local quadratic convergence rate of soft
arXiv Detail & Related papers (2024-04-04T11:16:16Z) - Optimization Landscape of Policy Gradient Methods for Discrete-time
Static Output Feedback [22.21598324895312]
This paper analyzes the optimization landscape inherent to policy gradient methods when applied to static output feedback control.
We derive novel findings regarding convergence (and nearly dimension-free rate) to stationary points for three policy gradient methods.
We provide proof that the vanilla policy gradient method exhibits linear convergence towards local minima when near such minima.
arXiv Detail & Related papers (2023-10-29T14:25:57Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - High-probability sample complexities for policy evaluation with linear function approximation [88.87036653258977]
We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms.
We establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level.
arXiv Detail & Related papers (2023-05-30T12:58:39Z) - On Linear Convergence of Policy Gradient Methods for Finite MDPs [8.00114449574708]
We revisit the finite time analysis of policy gradient methods in the one of the simplest settings.
We show that many variants of policy gradient methods succeed with large step-sizes and attain a linear rate of convergence.
arXiv Detail & Related papers (2020-07-21T22:35:37Z) - Deep Bayesian Quadrature Policy Optimization [100.81242753620597]
Deep Bayesian quadrature policy gradient (DBQPG) is a high-dimensional generalization of Bayesian quadrature for policy gradient estimation.
We show that DBQPG can substitute Monte-Carlo estimation in policy gradient methods, and demonstrate its effectiveness on a set of continuous control benchmarks.
arXiv Detail & Related papers (2020-06-28T15:44:47Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Statistically Efficient Off-Policy Policy Gradients [80.42316902296832]
We consider the statistically efficient estimation of policy gradients from off-policy data.
We propose a meta-algorithm that achieves the lower bound without any parametric assumptions.
We establish guarantees on the rate at which we approach a stationary point when we take steps in the direction of our new estimated policy gradient.
arXiv Detail & Related papers (2020-02-10T18:41:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.