Efficient Policy Iteration for Robust Markov Decision Processes via
Regularization
- URL: http://arxiv.org/abs/2205.14327v1
- Date: Sat, 28 May 2022 04:05:20 GMT
- Title: Efficient Policy Iteration for Robust Markov Decision Processes via
Regularization
- Authors: Navdeep Kumar, Kfir Levy, Kaixin Wang, Shie Mannor
- Abstract summary: Robust decision processes (MDPs) provide a framework to model decision problems where the system dynamics are changing or only partially known.
Recent work established the equivalence between texttts rectangular $L_p$ robust MDPs and regularized MDPs, and derived a regularized policy iteration scheme that enjoys the same level of efficiency as standard MDPs.
In this work, we focus on the policy improvement step and derive concrete forms for the greedy policy and the optimal robust Bellman operators.
- Score: 49.05403412954533
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Robust Markov decision processes (MDPs) provide a general framework to model
decision problems where the system dynamics are changing or only partially
known. Recent work established the equivalence between \texttt{s} rectangular
$L_p$ robust MDPs and regularized MDPs, and derived a regularized policy
iteration scheme that enjoys the same level of efficiency as standard MDPs.
However, there lacks a clear understanding of the policy improvement step. For
example, we know the greedy policy can be stochastic but have little clue how
each action affects this greedy policy. In this work, we focus on the policy
improvement step and derive concrete forms for the greedy policy and the
optimal robust Bellman operators. We find that the greedy policy is closely
related to some combination of the top $k$ actions, which provides a novel
characterization of its stochasticity. The exact nature of the combination
depends on the shape of the uncertainty set. Furthermore, our results allow us
to efficiently compute the policy improvement step by a simple binary search,
without turning to an external optimization subroutine. Moreover, for $L_1,
L_2$, and $L_\infty$ robust MDPs, we can even get rid of the binary search and
evaluate the optimal robust Bellman operators exactly. Our work greatly extends
existing results on solving \texttt{s}-rectangular $L_p$ robust MDPs via
regularized policy iteration and can be readily adapted to sample-based
model-free algorithms.
Related papers
- Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs [63.47351876442425]
We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback.
We propose a novel algorithm that combines the benefits of two popular methods: occupancy-measure-based and policy-based.
Our algorithm enjoys an $widetildemathcalO(d sqrtH3 K + sqrtHK(H + barP_K$)$ dynamic regret, where $d$ is the feature dimension.
arXiv Detail & Related papers (2024-11-05T13:55:52Z) - Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs [17.62509045102346]
This paper considers the best policy identification problem in online Constrained Markov Decision Processes (CMDPs)
We are interested in algorithms that are model-free, have low regret, and identify an approximately optimal policy with a high probability.
Existing model-free algorithms for online CMDPs with sublinear regret and constraint violation do not provide any convergence guarantee to an optimal policy.
arXiv Detail & Related papers (2023-09-27T04:33:09Z) - Twice Regularized Markov Decision Processes: The Equivalence between
Robustness and Regularization [64.60253456266872]
Markov decision processes (MDPs) aim to handle changing or partially known system dynamics.
Regularized MDPs show more stability in policy learning without impairing time complexity.
Bellman operators enable us to derive planning and learning schemes with convergence and generalization guarantees.
arXiv Detail & Related papers (2023-03-12T13:03:28Z) - An Efficient Solution to s-Rectangular Robust Markov Decision Processes [49.05403412954533]
We present an efficient robust value iteration for texttts-rectangular robust Markov Decision Processes (MDPs)
We do so by deriving the optimal robust Bellman operator in concrete forms using our $L_p$ water filling lemma.
We unveil the exact form of the optimal policies, which turn out to be novel threshold policies with the probability of playing an action proportional to its advantage.
arXiv Detail & Related papers (2023-01-31T13:54:23Z) - Policy Gradient for Rectangular Robust Markov Decision Processes [62.397882389472564]
We introduce robust policy gradient (RPG), a policy-based method that efficiently solves rectangular robust Markov decision processes (MDPs)
Our resulting RPG can be estimated from data with the same time complexity as its non-robust equivalent.
arXiv Detail & Related papers (2023-01-31T12:40:50Z) - Robust Average-Reward Markov Decision Processes [25.125481838479256]
We focus on robust average-reward MDPs, where the goal is to find a policy that optimize the worst-case average reward over an uncertainty set.
We take an approach that approximates average-reward MDPs using discounted MDPs.
We derive the robust Bellman equation for robust average-reward MDPs, prove that the optimal policy can be derived from its solution, and further design a robust relative value algorithm that provably finds its solution.
arXiv Detail & Related papers (2023-01-02T19:51:55Z) - First-order Policy Optimization for Robust Markov Decision Process [40.2022466644885]
We consider the problem of solving robust Markov decision process (MDP)
MDP involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels.
For $(mathbfs,mathbfa)$-rectangular uncertainty sets, we establish several structural observations on the robust objective.
arXiv Detail & Related papers (2022-09-21T18:10:28Z) - Twice regularized MDPs and the equivalence between robustness and
regularization [65.58188361659073]
We show that policy iteration on reward-robust MDPs can have the same time complexity as on regularized MDPs.
We generalize regularized MDPs to twice regularized MDPs.
arXiv Detail & Related papers (2021-10-12T18:33:45Z) - Partial Policy Iteration for L1-Robust Markov Decision Processes [13.555107578858307]
This paper describes new efficient algorithms for solving the common class of robust MDPs.
We propose partial policy iteration, a new, efficient, flexible, and general policy iteration scheme for robust MDPs.
Our experimental results indicate that the proposed methods are many orders of magnitude faster than the state-of-the-art approach.
arXiv Detail & Related papers (2020-06-16T19:50:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.