Bring Your Own (Non-Robust) Algorithm to Solve Robust MDPs by Estimating
The Worst Kernel
- URL: http://arxiv.org/abs/2306.05859v2
- Date: Mon, 12 Feb 2024 11:19:09 GMT
- Title: Bring Your Own (Non-Robust) Algorithm to Solve Robust MDPs by Estimating
The Worst Kernel
- Authors: Kaixin Wang, Uri Gadot, Navdeep Kumar, Kfir Levy, Shie Mannor
- Abstract summary: We present EWoK, a novel online approach to solve RMDP that Estimates the Worst transition Kernel to learn robust policies.
EWoK achieves robustness by simulating the worst scenarios for the agent while retaining complete flexibility in the learning process.
Our experiments, spanning from simple Cartpole to high-dimensional DeepMind Control Suite environments, demonstrate the effectiveness and applicability of EWoK.
- Score: 46.373217780462944
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Robust Markov Decision Processes (RMDPs) provide a framework for sequential
decision-making that is robust to perturbations on the transition kernel.
However, current RMDP methods are often limited to small-scale problems,
hindering their use in high-dimensional domains. To bridge this gap, we present
EWoK, a novel online approach to solve RMDP that Estimates the Worst transition
Kernel to learn robust policies. Unlike previous works that regularize the
policy or value updates, EWoK achieves robustness by simulating the worst
scenarios for the agent while retaining complete flexibility in the learning
process. Notably, EWoK can be applied on top of any off-the-shelf {\em
non-robust} RL algorithm, enabling easy scaling to high-dimensional domains.
Our experiments, spanning from simple Cartpole to high-dimensional DeepMind
Control Suite environments, demonstrate the effectiveness and applicability of
the EWoK paradigm as a practical method for learning robust policies.
Related papers
- Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form [26.01796404477275]
This paper presents the first algorithm capable of identifying a near-optimal policy in a robust constrained MDP (RCMDP)
An optimal policy minimizes cumulative cost while satisfying constraints in the worst-case scenario across a set of environments.
arXiv Detail & Related papers (2024-08-29T06:37:16Z) - Two-Stage ML-Guided Decision Rules for Sequential Decision Making under Uncertainty [55.06411438416805]
Sequential Decision Making under Uncertainty (SDMU) is ubiquitous in many domains such as energy, finance, and supply chains.
Some SDMU are naturally modeled as Multistage Problems (MSPs) but the resulting optimizations are notoriously challenging from a computational standpoint.
This paper introduces a novel approach Two-Stage General Decision Rules (TS-GDR) to generalize the policy space beyond linear functions.
The effectiveness of TS-GDR is demonstrated through an instantiation using Deep Recurrent Neural Networks named Two-Stage Deep Decision Rules (TS-LDR)
arXiv Detail & Related papers (2024-05-23T18:19:47Z) - Policy Gradient for Rectangular Robust Markov Decision Processes [62.397882389472564]
We introduce robust policy gradient (RPG), a policy-based method that efficiently solves rectangular robust Markov decision processes (MDPs)
Our resulting RPG can be estimated from data with the same time complexity as its non-robust equivalent.
arXiv Detail & Related papers (2023-01-31T12:40:50Z) - Policy Gradient in Robust MDPs with Global Convergence Guarantee [13.40471012593073]
Robust Markov decision processes (RMDPs) provide a promising framework for computing reliable policies in the face of model errors.
This paper proposes a new Double-Loop Robust Policy Gradient (DRPG), the first generic policy gradient method for RMDPs.
In contrast with prior robust policy gradient algorithms, DRPG monotonically reduces approximation errors to guarantee convergence to a globally optimal policy.
arXiv Detail & Related papers (2022-12-20T17:14:14Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Reinforcement Learning with a Terminator [80.34572413850186]
We learn the parameters of the TerMDP and leverage the structure of the estimation problem to provide state-wise confidence bounds.
We use these to construct a provably-efficient algorithm, which accounts for termination, and bound its regret.
arXiv Detail & Related papers (2022-05-30T18:40:28Z) - Safe Exploration by Solving Early Terminated MDP [77.10563395197045]
We introduce a new approach to address safe RL problems under the framework of Early TerminatedP (ET-MDP)
We first define the ET-MDP as an unconstrained algorithm with the same optimal value function as its corresponding CMDP.
An off-policy algorithm based on context models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP with better performance and improved learning efficiency.
arXiv Detail & Related papers (2021-07-09T04:24:40Z) - Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization
under Model Uncertainty [9.246374019271935]
We propose to merge the theory of constrained Markov decision process (CMDP) with the theory of robust Markov decision process (RMDP)
This formulation allows us to design RL algorithms that are robust in performance, and provides constraint satisfaction guarantees.
We first propose the general problem formulation under the concept of RCMDP, and then propose a Lagrangian formulation of the optimal problem, leading to a robust-constrained policy gradient RL algorithm.
arXiv Detail & Related papers (2020-10-10T01:53:37Z) - Robust Reinforcement Learning using Least Squares Policy Iteration with
Provable Performance Guarantees [3.8073142980733]
This paper addresses the problem of model-free reinforcement learning for Robust Markov Decision Process (RMDP) with large state spaces.
We first propose the Robust Least Squares Policy Evaluation algorithm, which is a multi-step online model-free learning algorithm for policy evaluation.
We then propose Robust Least Squares Policy Iteration (RLSPI) algorithm for learning the optimal robust policy.
arXiv Detail & Related papers (2020-06-20T16:26:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.