Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization
- URL: http://arxiv.org/abs/2309.01107v2
- Date: Mon, 12 Feb 2024 11:23:29 GMT
- Title: Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization
- Authors: Uri Gadot, Esther Derman, Navdeep Kumar, Maxence Mohamed Elfatihi,
Kfir Levy, Shie Mannor
- Abstract summary: In robust Markov decision processes (RMDPs) it is assumed that the reward and the transition dynamics lie in a given uncertainty set.
This so-called rectangularity condition is solely motivated by computational concerns.
We introduce a policy-gradient method and prove its convergence.
- Score: 39.740287682191884
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In robust Markov decision processes (RMDPs), it is assumed that the reward
and the transition dynamics lie in a given uncertainty set. By targeting
maximal return under the most adversarial model from that set, RMDPs address
performance sensitivity to misspecified environments. Yet, to preserve
computational tractability, the uncertainty set is traditionally independently
structured for each state. This so-called rectangularity condition is solely
motivated by computational concerns. As a result, it lacks a practical
incentive and may lead to overly conservative behavior. In this work, we study
coupled reward RMDPs where the transition kernel is fixed, but the reward
function lies within an $\alpha$-radius from a nominal one. We draw a direct
connection between this type of non-rectangular reward-RMDPs and applying
policy visitation frequency regularization. We introduce a policy-gradient
method and prove its convergence. Numerical experiments illustrate the learned
policy's robustness and its less conservative behavior when compared to
rectangular uncertainty.
Related papers
- Simplification of Risk Averse POMDPs with Performance Guarantees [6.129902017281406]
Risk averse decision making under uncertainty in partially observable domains is a fundamental problem in AI and essential for reliable autonomous agents.
In our case, the problem is modeled using partially observable Markov decision processes (POMDPs), when the value function is the conditional value at risk (CVaR) of the return.
Calculating an optimal solution for POMDPs is computationally intractable in general.
We develop a simplification framework to speedup the evaluation of the value function, while providing performance guarantees.
arXiv Detail & Related papers (2024-06-05T07:05:52Z) - Efficient and Sharp Off-Policy Evaluation in Robust Markov Decision Processes [44.974100402600165]
We study the evaluation of a policy best-parametric and worst-case perturbations to a decision process (MDP)
We use transition observations from the original MDP, whether they are generated under the same or a different policy.
Our estimator is also estimated statistical inference using Wald confidence intervals.
arXiv Detail & Related papers (2024-03-29T18:11:49Z) - On the Global Convergence of Policy Gradient in Average Reward Markov
Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs)
Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z) - Off-Policy Evaluation in Markov Decision Processes under Weak
Distributional Overlap [5.0401589279256065]
We re-visit the task of off-policy evaluation in Markov decision processes (MDPs) under a weaker notion of distributional overlap.
We introduce a class of truncated doubly robust (TDR) estimators which we find to perform well in this setting.
arXiv Detail & Related papers (2024-02-13T03:55:56Z) - Solving Long-run Average Reward Robust MDPs via Stochastic Games [6.183091173390457]
Robust Markov decision processes (RMDPs) assign to each transition an uncertainty set rather than a single probability value.
We show that it can be reduced to solving long-run average reward turn-based games with finite state and action spaces.
We present Robust Polytopic Policy Iteration (RPPI), a novel policy iteration algorithm for solving long-run average reward polytopic RMDPs.
arXiv Detail & Related papers (2023-12-21T15:00:06Z) - The Curious Price of Distributional Robustness in Reinforcement Learning with a Generative Model [61.87673435273466]
This paper investigates model robustness in reinforcement learning (RL) to reduce the sim-to-real gap in practice.
We adopt the framework of distributionally robust Markov decision processes (RMDPs), aimed at learning a policy that optimize the worst-case performance when the deployed environment falls within a prescribed uncertainty set around the nominal MDP.
arXiv Detail & Related papers (2023-05-26T02:32:03Z) - Twice Regularized Markov Decision Processes: The Equivalence between
Robustness and Regularization [64.60253456266872]
Markov decision processes (MDPs) aim to handle changing or partially known system dynamics.
Regularized MDPs show more stability in policy learning without impairing time complexity.
Bellman operators enable us to derive planning and learning schemes with convergence and generalization guarantees.
arXiv Detail & Related papers (2023-03-12T13:03:28Z) - Policy Gradient for Rectangular Robust Markov Decision Processes [62.397882389472564]
We introduce robust policy gradient (RPG), a policy-based method that efficiently solves rectangular robust Markov decision processes (MDPs)
Our resulting RPG can be estimated from data with the same time complexity as its non-robust equivalent.
arXiv Detail & Related papers (2023-01-31T12:40:50Z) - Distributional Robustness and Regularization in Reinforcement Learning [62.23012916708608]
We introduce a new regularizer for empirical value functions and show that it lower bounds the Wasserstein distributionally robust value function.
It suggests using regularization as a practical tool for dealing with $textitexternal uncertainty$ in reinforcement learning.
arXiv Detail & Related papers (2020-03-05T19:56:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.