First-order Policy Optimization for Robust Markov Decision Process
- URL: http://arxiv.org/abs/2209.10579v2
- Date: Sat, 10 Jun 2023 21:34:45 GMT
- Title: First-order Policy Optimization for Robust Markov Decision Process
- Authors: Yan Li, Guanghui Lan, Tuo Zhao
- Abstract summary: We consider the problem of solving robust Markov decision process (MDP)
MDP involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels.
For $(mathbfs,mathbfa)$-rectangular uncertainty sets, we establish several structural observations on the robust objective.
- Score: 40.2022466644885
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the problem of solving robust Markov decision process (MDP),
which involves a set of discounted, finite state, finite action space MDPs with
uncertain transition kernels. The goal of planning is to find a robust policy
that optimizes the worst-case values against the transition uncertainties, and
thus encompasses the standard MDP planning as a special case. For
$(\mathbf{s},\mathbf{a})$-rectangular uncertainty sets, we establish several
structural observations on the robust objective, which facilitates the
development of a policy-based first-order method, namely the robust policy
mirror descent (RPMD). An $\mathcal{O}(\log(1/\epsilon))$ iteration complexity
for finding an $\epsilon$-optimal policy is established with linearly
increasing stepsizes. We further develop a stochastic variant of the robust
policy mirror descent method, named SRPMD, when the first-order information is
only available through online interactions with the nominal environment. We
show that the optimality gap converges linearly up to the noise level, and
consequently establish an $\tilde{\mathcal{O}}(1/\epsilon^2)$ sample complexity
by developing a temporal difference learning method for policy evaluation. Both
iteration and sample complexities are also discussed for RPMD with a constant
stepsize. To the best of our knowledge, all the aforementioned results appear
to be new for policy-based first-order methods applied to the robust MDP
problem.
Related papers
- Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form [26.01796404477275]
This paper presents the first algorithm capable of identifying a near-optimal policy in a robust constrained MDP (RCMDP)
An optimal policy minimizes cumulative cost while satisfying constraints in the worst-case scenario across a set of environments.
arXiv Detail & Related papers (2024-08-29T06:37:16Z) - Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs [82.34567890576423]
We develop a deterministic policy gradient primal-dual method to find an optimal deterministic policy with non-asymptotic convergence.
We prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair.
To the best of our knowledge, this appears to be the first work that proposes a deterministic policy search method for continuous-space constrained MDPs.
arXiv Detail & Related papers (2024-08-19T14:11:04Z) - First-order Policy Optimization for Robust Policy Evaluation [10.772560347950053]
We adopt a policy optimization viewpoint towards policy evaluation for robust Markov decision process with $mathrms$rectangular ambiguity sets.
The developed method, named first-order policy evaluation (FRPE), provides the first unified framework for robust policy evaluation in both deterministic (offline) and linear (online) settings.
arXiv Detail & Related papers (2023-07-29T05:22:43Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - A Theoretical Analysis of Optimistic Proximal Policy Optimization in
Linear Markov Decision Processes [13.466249082564213]
We propose an optimistic variant of PPO for episodic adversarial linear MDPs with full-information feedback.
Compared with existing policy-based algorithms, we achieve the state-of-the-art regret bound in both linear MDPs and adversarial linear MDPs with full information.
arXiv Detail & Related papers (2023-05-15T17:55:24Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Efficient Policy Iteration for Robust Markov Decision Processes via
Regularization [49.05403412954533]
Robust decision processes (MDPs) provide a framework to model decision problems where the system dynamics are changing or only partially known.
Recent work established the equivalence between texttts rectangular $L_p$ robust MDPs and regularized MDPs, and derived a regularized policy iteration scheme that enjoys the same level of efficiency as standard MDPs.
In this work, we focus on the policy improvement step and derive concrete forms for the greedy policy and the optimal robust Bellman operators.
arXiv Detail & Related papers (2022-05-28T04:05:20Z) - Stochastic first-order methods for average-reward Markov decision processes [10.023632561462712]
We study average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy optimization and policy evaluation.
By combining the policy evaluation and policy optimization parts, we establish sample complexity results for solving AMDPs under both generative and Markovian noise models.
arXiv Detail & Related papers (2022-05-11T23:02:46Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.