Robust Average-Reward Markov Decision Processes
- URL: http://arxiv.org/abs/2301.00858v1
- Date: Mon, 2 Jan 2023 19:51:55 GMT
- Title: Robust Average-Reward Markov Decision Processes
- Authors: Yue Wang, Alvaro Velasquez, George Atia, Ashley Prater-Bennette,
Shaofeng Zou
- Abstract summary: We focus on robust average-reward MDPs, where the goal is to find a policy that optimize the worst-case average reward over an uncertainty set.
We take an approach that approximates average-reward MDPs using discounted MDPs.
We derive the robust Bellman equation for robust average-reward MDPs, prove that the optimal policy can be derived from its solution, and further design a robust relative value algorithm that provably finds its solution.
- Score: 25.125481838479256
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In robust Markov decision processes (MDPs), the uncertainty in the transition
kernel is addressed by finding a policy that optimizes the worst-case
performance over an uncertainty set of MDPs. While much of the literature has
focused on discounted MDPs, robust average-reward MDPs remain largely
unexplored. In this paper, we focus on robust average-reward MDPs, where the
goal is to find a policy that optimizes the worst-case average reward over an
uncertainty set. We first take an approach that approximates average-reward
MDPs using discounted MDPs. We prove that the robust discounted value function
converges to the robust average-reward as the discount factor $\gamma$ goes to
$1$, and moreover, when $\gamma$ is large, any optimal policy of the robust
discounted MDP is also an optimal policy of the robust average-reward. We
further design a robust dynamic programming approach, and theoretically
characterize its convergence to the optimum. Then, we investigate robust
average-reward MDPs directly without using discounted MDPs as an intermediate
step. We derive the robust Bellman equation for robust average-reward MDPs,
prove that the optimal policy can be derived from its solution, and further
design a robust relative value iteration algorithm that provably finds its
solution, or equivalently, the optimal robust policy.
Related papers
- On the Global Convergence of Policy Gradient in Average Reward Markov
Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs)
Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z) - Policy Gradient Algorithms for Robust MDPs with Non-Rectangular
Uncertainty Sets [10.26382228865201]
We propose policy gradient algorithms for robust infinite-horizon Markov decision processes (MDPs) with non-rectangular uncertainty sets.
The corresponding robust MDPs cannot be solved with dynamic programming techniques and are in fact provably intractable.
We thus present the first complete solution scheme for robust MDPs with non-rectangular uncertainty sets offering global optimality guarantees.
arXiv Detail & Related papers (2023-05-30T13:02:25Z) - Twice Regularized Markov Decision Processes: The Equivalence between
Robustness and Regularization [64.60253456266872]
Markov decision processes (MDPs) aim to handle changing or partially known system dynamics.
Regularized MDPs show more stability in policy learning without impairing time complexity.
Bellman operators enable us to derive planning and learning schemes with convergence and generalization guarantees.
arXiv Detail & Related papers (2023-03-12T13:03:28Z) - Global Algorithms for Mean-Variance Optimization in Markov Decision
Processes [8.601670707452083]
Dynamic optimization of mean and variance in Markov decision processes (MDPs) is a long-standing challenge caused by the failure of dynamic programming.
We propose a new approach to find the globally optimal policy for combined metrics of steady-state mean and variance in an infinite-horizon undiscounted MDP.
arXiv Detail & Related papers (2023-02-27T12:17:43Z) - ACPO: A Policy Optimization Algorithm for Average MDPs with Constraints [36.16736392624796]
We introduce a new policy optimization with function approximation algorithm for constrained MDPs with the average criterion.
We develop basic sensitivity theory for average CMDPs, and then use the corresponding bounds in the design of the algorithm.
We show its superior empirical performance when compared to other state-of-the-art algorithms adapted for the ACMDPs.
arXiv Detail & Related papers (2023-02-02T00:23:36Z) - An Efficient Solution to s-Rectangular Robust Markov Decision Processes [49.05403412954533]
We present an efficient robust value iteration for texttts-rectangular robust Markov Decision Processes (MDPs)
We do so by deriving the optimal robust Bellman operator in concrete forms using our $L_p$ water filling lemma.
We unveil the exact form of the optimal policies, which turn out to be novel threshold policies with the probability of playing an action proportional to its advantage.
arXiv Detail & Related papers (2023-01-31T13:54:23Z) - First-order Policy Optimization for Robust Markov Decision Process [40.2022466644885]
We consider the problem of solving robust Markov decision process (MDP)
MDP involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels.
For $(mathbfs,mathbfa)$-rectangular uncertainty sets, we establish several structural observations on the robust objective.
arXiv Detail & Related papers (2022-09-21T18:10:28Z) - Efficient Policy Iteration for Robust Markov Decision Processes via
Regularization [49.05403412954533]
Robust decision processes (MDPs) provide a framework to model decision problems where the system dynamics are changing or only partially known.
Recent work established the equivalence between texttts rectangular $L_p$ robust MDPs and regularized MDPs, and derived a regularized policy iteration scheme that enjoys the same level of efficiency as standard MDPs.
In this work, we focus on the policy improvement step and derive concrete forms for the greedy policy and the optimal robust Bellman operators.
arXiv Detail & Related papers (2022-05-28T04:05:20Z) - Twice regularized MDPs and the equivalence between robustness and
regularization [65.58188361659073]
We show that policy iteration on reward-robust MDPs can have the same time complexity as on regularized MDPs.
We generalize regularized MDPs to twice regularized MDPs.
arXiv Detail & Related papers (2021-10-12T18:33:45Z) - Safe Exploration by Solving Early Terminated MDP [77.10563395197045]
We introduce a new approach to address safe RL problems under the framework of Early TerminatedP (ET-MDP)
We first define the ET-MDP as an unconstrained algorithm with the same optimal value function as its corresponding CMDP.
An off-policy algorithm based on context models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP with better performance and improved learning efficiency.
arXiv Detail & Related papers (2021-07-09T04:24:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.