First-order Policy Optimization for Robust Markov Decision Process
- URL: http://arxiv.org/abs/2209.10579v2
- Date: Sat, 10 Jun 2023 21:34:45 GMT
- Title: First-order Policy Optimization for Robust Markov Decision Process
- Authors: Yan Li, Guanghui Lan, Tuo Zhao
- Abstract summary: We consider the problem of solving robust Markov decision process (MDP)
MDP involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels.
For $(mathbfs,mathbfa)$-rectangular uncertainty sets, we establish several structural observations on the robust objective.
- Score: 40.2022466644885
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the problem of solving robust Markov decision process (MDP),
which involves a set of discounted, finite state, finite action space MDPs with
uncertain transition kernels. The goal of planning is to find a robust policy
that optimizes the worst-case values against the transition uncertainties, and
thus encompasses the standard MDP planning as a special case. For
$(\mathbf{s},\mathbf{a})$-rectangular uncertainty sets, we establish several
structural observations on the robust objective, which facilitates the
development of a policy-based first-order method, namely the robust policy
mirror descent (RPMD). An $\mathcal{O}(\log(1/\epsilon))$ iteration complexity
for finding an $\epsilon$-optimal policy is established with linearly
increasing stepsizes. We further develop a stochastic variant of the robust
policy mirror descent method, named SRPMD, when the first-order information is
only available through online interactions with the nominal environment. We
show that the optimality gap converges linearly up to the noise level, and
consequently establish an $\tilde{\mathcal{O}}(1/\epsilon^2)$ sample complexity
by developing a temporal difference learning method for policy evaluation. Both
iteration and sample complexities are also discussed for RPMD with a constant
stepsize. To the best of our knowledge, all the aforementioned results appear
to be new for policy-based first-order methods applied to the robust MDP
problem.
Related papers
- First-order Policy Optimization for Robust Policy Evaluation [10.772560347950053]
We adopt a policy optimization viewpoint towards policy evaluation for robust Markov decision process with $mathrms$rectangular ambiguity sets.
The developed method, named first-order policy evaluation (FRPE), provides the first unified framework for robust policy evaluation in both deterministic (offline) and linear (online) settings.
arXiv Detail & Related papers (2023-07-29T05:22:43Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - A Theoretical Analysis of Optimistic Proximal Policy Optimization in
Linear Markov Decision Processes [13.466249082564213]
We propose an optimistic variant of PPO for episodic adversarial linear MDPs with full-information feedback.
Compared with existing policy-based algorithms, we achieve the state-of-the-art regret bound in both linear MDPs and adversarial linear MDPs with full information.
arXiv Detail & Related papers (2023-05-15T17:55:24Z) - Optimal Convergence Rate for Exact Policy Mirror Descent in Discounted
Markov Decision Processes [18.35462792871242]
Policy Mirror Descent is a family of algorithms that covers a wide range of novel and fundamental methods in reinforcement learning.
Motivated by the instability of policy iteration with inexact policy evaluation, PMD algorithmically regularises the policy improvement step of PI.
We show that the dimension-free $gamma$-rate of PI can be achieved by the general family of unregularised PMD algorithms under an adaptive step-size.
arXiv Detail & Related papers (2023-02-22T13:55:08Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Efficient Policy Iteration for Robust Markov Decision Processes via
Regularization [49.05403412954533]
Robust decision processes (MDPs) provide a framework to model decision problems where the system dynamics are changing or only partially known.
Recent work established the equivalence between texttts rectangular $L_p$ robust MDPs and regularized MDPs, and derived a regularized policy iteration scheme that enjoys the same level of efficiency as standard MDPs.
In this work, we focus on the policy improvement step and derive concrete forms for the greedy policy and the optimal robust Bellman operators.
arXiv Detail & Related papers (2022-05-28T04:05:20Z) - Stochastic first-order methods for average-reward Markov decision
processes [10.483316336206903]
We study the problem of average-reward Markov decision processes (AMDPs)
We develop novel first-order methods with strong theoretical guarantees for both policy evaluation and optimization.
arXiv Detail & Related papers (2022-05-11T23:02:46Z) - Parallel Stochastic Mirror Descent for MDPs [72.75921150912556]
We consider the problem of learning the optimal policy for infinite-horizon Markov decision processes (MDPs)
Some variant of Mirror Descent is proposed for convex programming problems with Lipschitz-continuous functionals.
We analyze this algorithm in a general case and obtain an estimate of the convergence rate that does not accumulate errors during the operation of the method.
arXiv Detail & Related papers (2021-02-27T19:28:39Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.