Safe Exploration by Solving Early Terminated MDP
- URL: http://arxiv.org/abs/2107.04200v1
- Date: Fri, 9 Jul 2021 04:24:40 GMT
- Title: Safe Exploration by Solving Early Terminated MDP
- Authors: Hao Sun, Ziping Xu, Meng Fang, Zhenghao Peng, Jiadong Guo, Bo Dai,
Bolei Zhou
- Abstract summary: We introduce a new approach to address safe RL problems under the framework of Early TerminatedP (ET-MDP)
We first define the ET-MDP as an unconstrained algorithm with the same optimal value function as its corresponding CMDP.
An off-policy algorithm based on context models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP with better performance and improved learning efficiency.
- Score: 77.10563395197045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Safe exploration is crucial for the real-world application of reinforcement
learning (RL). Previous works consider the safe exploration problem as
Constrained Markov Decision Process (CMDP), where the policies are being
optimized under constraints. However, when encountering any potential dangers,
human tends to stop immediately and rarely learns to behave safely in danger.
Motivated by human learning, we introduce a new approach to address safe RL
problems under the framework of Early Terminated MDP (ET-MDP). We first define
the ET-MDP as an unconstrained MDP with the same optimal value function as its
corresponding CMDP. An off-policy algorithm based on context models is then
proposed to solve the ET-MDP, which thereby solves the corresponding CMDP with
better asymptotic performance and improved learning efficiency. Experiments on
various CMDP tasks show a substantial improvement over previous methods that
directly solve CMDP.
Related papers
- Solving Multi-Model MDPs by Coordinate Ascent and Dynamic Programming [8.495921422521068]
Multi-model Markov decision process (MMDP) is a promising framework for computing policies.
MMDPs aim to find a policy that maximizes the expected return over a distribution of MDP models.
We propose CADP, which combines a coordinate ascent method and a dynamic programming algorithm for solving MMDPs.
arXiv Detail & Related papers (2024-07-08T18:47:59Z) - Robust Average-Reward Markov Decision Processes [25.125481838479256]
We focus on robust average-reward MDPs, where the goal is to find a policy that optimize the worst-case average reward over an uncertainty set.
We take an approach that approximates average-reward MDPs using discounted MDPs.
We derive the robust Bellman equation for robust average-reward MDPs, prove that the optimal policy can be derived from its solution, and further design a robust relative value algorithm that provably finds its solution.
arXiv Detail & Related papers (2023-01-02T19:51:55Z) - Optimality Guarantees for Particle Belief Approximation of POMDPs [55.83001584645448]
Partially observable Markov decision processes (POMDPs) provide a flexible representation for real-world decision and control problems.
POMDPs are notoriously difficult to solve, especially when the state and observation spaces are continuous or hybrid.
We propose a theory characterizing the approximation error of the particle filtering techniques that these algorithms use.
arXiv Detail & Related papers (2022-10-10T21:11:55Z) - Semi-Markov Offline Reinforcement Learning for Healthcare [57.15307499843254]
We introduce three offline RL algorithms, namely, SDQN, SDDQN, and SBCQ.
We experimentally demonstrate that only these algorithms learn the optimal policy in variable-time environments.
We apply our new algorithms to a real-world offline dataset pertaining to warfarin dosing for stroke prevention.
arXiv Detail & Related papers (2022-03-17T14:51:21Z) - Twice regularized MDPs and the equivalence between robustness and
regularization [65.58188361659073]
We show that policy iteration on reward-robust MDPs can have the same time complexity as on regularized MDPs.
We generalize regularized MDPs to twice regularized MDPs.
arXiv Detail & Related papers (2021-10-12T18:33:45Z) - RL for Latent MDPs: Regret Guarantees and a Lower Bound [74.41782017817808]
We consider the regret problem for reinforcement learning in latent Markov Decision Processes (LMDP)
In an LMDP, an MDP is randomly drawn from a set of $M$ possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent.
We show that the key link is a notion of separation between the MDP system dynamics.
arXiv Detail & Related papers (2021-02-09T16:49:58Z) - Exploration-Exploitation in Constrained MDPs [79.23623305214275]
We investigate the exploration-exploitation dilemma in Constrained Markov Decision Processes (CMDPs)
While learning in an unknown CMDP, an agent should trade-off exploration to discover new information about the MDP.
While the agent will eventually learn a good or optimal policy, we do not want the agent to violate the constraints too often during the learning process.
arXiv Detail & Related papers (2020-03-04T17:03:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.