Related papers: First-order Policy Optimization for Robust Markov Decision Process

First-order Policy Optimization for Robust Markov Decision Process

URL: http://arxiv.org/abs/2209.10579v2
Date: Sat, 10 Jun 2023 21:34:45 GMT
Title: First-order Policy Optimization for Robust Markov Decision Process
Authors: Yan Li, Guanghui Lan, Tuo Zhao
Abstract summary: We consider the problem of solving robust Markov decision process (MDP) MDP involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels. For $(mathbfs,mathbfa)$-rectangular uncertainty sets, we establish several structural observations on the robust objective.
Score: 40.2022466644885
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We consider the problem of solving robust Markov decision process (MDP), which involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels. The goal of planning is to find a robust policy that optimizes the worst-case values against the transition uncertainties, and thus encompasses the standard MDP planning as a special case. For $(\mathbf{s},\mathbf{a})$-rectangular uncertainty sets, we establish several structural observations on the robust objective, which facilitates the development of a policy-based first-order method, namely the robust policy mirror descent (RPMD). An $\mathcal{O}(\log(1/\epsilon))$ iteration complexity for finding an $\epsilon$-optimal policy is established with linearly increasing stepsizes. We further develop a stochastic variant of the robust policy mirror descent method, named SRPMD, when the first-order information is only available through online interactions with the nominal environment. We show that the optimality gap converges linearly up to the noise level, and consequently establish an $\tilde{\mathcal{O}}(1/\epsilon^2)$ sample complexity by developing a temporal difference learning method for policy evaluation. Both iteration and sample complexities are also discussed for RPMD with a constant stepsize. To the best of our knowledge, all the aforementioned results appear to be new for policy-based first-order methods applied to the robust MDP problem.

Related papers

Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form [26.01796404477275]
This paper presents the first algorithm capable of identifying a near-optimal policy in a robust constrained MDP (RCMDP) An optimal policy minimizes cumulative cost while satisfying constraints in the worst-case scenario across a set of environments.
arXiv Detail & Related papers (2024-08-29T06:37:16Z)
Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs [82.34567890576423]
We develop a deterministic policy gradient primal-dual method to find an optimal deterministic policy with non-asymptotic convergence. We prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair. To the best of our knowledge, this appears to be the first work that proposes a deterministic policy search method for continuous-space constrained MDPs.
arXiv Detail & Related papers (2024-08-19T14:11:04Z)
Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs via Approximation by Discounted-Reward MDPs [16.49229317664822]
We study the problem of infinite-horizon average-reward reinforcement learning with linear decision processes (MDPs) Our approach approximates the average-reward setting by a discounted discounting factor, then applies an optimistic value iteration.
arXiv Detail & Related papers (2024-05-23T20:58:33Z)
First-order Policy Optimization for Robust Policy Evaluation [10.772560347950053]
We adopt a policy optimization viewpoint towards policy evaluation for robust Markov decision process with $mathrms$rectangular ambiguity sets. The developed method, named first-order policy evaluation (FRPE), provides the first unified framework for robust policy evaluation in both deterministic (offline) and linear (online) settings.
arXiv Detail & Related papers (2023-07-29T05:22:43Z)
Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP) We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy. To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z)
A Theoretical Analysis of Optimistic Proximal Policy Optimization in Linear Markov Decision Processes [13.466249082564213]
We propose an optimistic variant of PPO for episodic adversarial linear MDPs with full-information feedback. Compared with existing policy-based algorithms, we achieve the state-of-the-art regret bound in both linear MDPs and adversarial linear MDPs with full information.
arXiv Detail & Related papers (2023-05-15T17:55:24Z)
Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm. We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z)
Efficient Policy Iteration for Robust Markov Decision Processes via Regularization [49.05403412954533]
Robust decision processes (MDPs) provide a framework to model decision problems where the system dynamics are changing or only partially known. Recent work established the equivalence between texttts rectangular $L_p$ robust MDPs and regularized MDPs, and derived a regularized policy iteration scheme that enjoys the same level of efficiency as standard MDPs. In this work, we focus on the policy improvement step and derive concrete forms for the greedy policy and the optimal robust Bellman operators.
arXiv Detail & Related papers (2022-05-28T04:05:20Z)
Stochastic first-order methods for average-reward Markov decision processes [10.023632561462712]
We study average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy optimization and policy evaluation. By combining the policy evaluation and policy optimization parts, we establish sample complexity results for solving AMDPs under both generative and Markovian noise models.
arXiv Detail & Related papers (2022-05-11T23:02:46Z)
Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs [113.8752163061151]
We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs) We propose the underlineperiodically underlinerestarted underlineoptimistic underlinepolicy underlineoptimization algorithm (PROPO) PROPO features two mechanisms: sliding-window-based policy evaluation and periodic-restart-based policy improvement.
arXiv Detail & Related papers (2021-10-18T02:33:20Z)
Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.