C-MCTS: Safe Planning with Monte Carlo Tree Search
- URL: http://arxiv.org/abs/2305.16209v4
- Date: Sun, 27 Oct 2024 08:11:16 GMT
- Title: C-MCTS: Safe Planning with Monte Carlo Tree Search
- Authors: Dinesh Parthasarathy, Georgios Kontes, Axel Plinge, Christopher Mutschler,
- Abstract summary: The Constrained Markov Decision Process (CMDP) formulation allows to solve safety-critical decision making tasks that are subject to constraints.
We propose Constrained MCTS (C-MCTS), which estimates cost using a safety critic that is trained with Temporal Difference learning in an offline phase prior to agent deployment.
C-MCTS satisfies cost constraints but operates closer to the constraint boundary, achieving higher rewards than previous work.
- Score: 2.8445375187526154
- License:
- Abstract: The Constrained Markov Decision Process (CMDP) formulation allows to solve safety-critical decision making tasks that are subject to constraints. While CMDPs have been extensively studied in the Reinforcement Learning literature, little attention has been given to sampling-based planning algorithms such as MCTS for solving them. Previous approaches perform conservatively with respect to costs as they avoid constraint violations by using Monte Carlo cost estimates that suffer from high variance. We propose Constrained MCTS (C-MCTS), which estimates cost using a safety critic that is trained with Temporal Difference learning in an offline phase prior to agent deployment. The critic limits exploration by pruning unsafe trajectories within MCTS during deployment. C-MCTS satisfies cost constraints but operates closer to the constraint boundary, achieving higher rewards than previous work. As a nice byproduct, the planner is more efficient w.r.t. planning steps. Most importantly, under model mismatch between the planner and the real world, C-MCTS is less susceptible to cost violations than previous work.
Related papers
- Threshold UCT: Cost-Constrained Monte Carlo Tree Search with Pareto Curves [1.799933345199395]
Constrained Markov decision processes (CMDPs) are the leading framework for safe sequential decision making under uncertainty.
We introduce Threshold UCT, an online MCTS-based algorithm for CMDP planning.
Our experiments demonstrate that our approach significantly outperforms state-of-the-art methods from the literature.
arXiv Detail & Related papers (2024-12-18T15:41:47Z) - Monte Carlo Planning for Stochastic Control on Constrained Markov Decision Processes [1.445706856497821]
This work defines an MDP framework, the textttSD-MDP, where we disentangle the causal structure of MDPs' transition and reward dynamics.
We derive theoretical guarantees on the estimation error of the value function under an optimal policy by allowing independent value estimation from Monte Carlo sampling.
arXiv Detail & Related papers (2024-06-23T16:22:40Z) - ConstrainedZero: Chance-Constrained POMDP Planning using Learned Probabilistic Failure Surrogates and Adaptive Safety Constraints [34.9739641898452]
This work introduces the ConstrainedZero policy algorithm that solves CC-POMDPs in belief space by learning neural network approximations of the optimal value and policy.
Results show that by separating safety constraints from the objective we can achieve a target level of safety without optimizing the balance between rewards and costs.
arXiv Detail & Related papers (2024-05-01T17:17:22Z) - Anytime-Competitive Reinforcement Learning with Policy Prior [41.45104303955067]
A-CMDP is to optimize the expected reward while guaranteeing a bounded cost in each round of any episode against a policy prior.
We propose a new algorithm, called Anytime-Competitive Reinforcement Learning (ACRL), which provably guarantees the anytime cost constraints.
arXiv Detail & Related papers (2023-11-02T19:44:59Z) - Constrained Hierarchical Monte Carlo Belief-State Planning [35.606121916832144]
We introduce Constrained Options Belief Tree Search (COBeTS) to scale online search-based CPOMDP planning to large robotic problems.
If primitive option controllers are defined to satisfy assigned constraint budgets, COBeTS will satisfy constraints anytime.
We demonstrate COBeTS in several safety-critical, constrained partially observable robotic domains.
arXiv Detail & Related papers (2023-10-30T22:16:53Z) - Bayes risk CTC: Controllable CTC alignment in Sequence-to-Sequence tasks [63.189632935619535]
Bayes risk CTC (BRCTC) is proposed to enforce the desired characteristics of the predicted alignment.
By using BRCTC with another preference for early emissions, we obtain an improved performance-latency trade-off for online models.
arXiv Detail & Related papers (2022-10-14T03:55:36Z) - Continuous Monte Carlo Graph Search [61.11769232283621]
Continuous Monte Carlo Graph Search ( CMCGS) is an extension of Monte Carlo Tree Search (MCTS) to online planning.
CMCGS takes advantage of the insight that, during planning, sharing the same action policy between several states can yield high performance.
It can be scaled up through parallelization, and it outperforms the Cross-Entropy Method (CEM) in continuous control with learned dynamics models.
arXiv Detail & Related papers (2022-10-04T07:34:06Z) - A Unifying Framework for Online Optimization with Long-Term Constraints [62.35194099438855]
We study online learning problems in which a decision maker has to take a sequence of decisions subject to $m$ long-term constraints.
The goal is to maximize their total reward, while at the same time achieving small cumulative violation across the $T$ rounds.
We present the first best-of-both-world type algorithm for this general class problems, with no-regret guarantees both in the case in which rewards and constraints are selected according to an unknown model, and in the case in which they are selected at each round by an adversary.
arXiv Detail & Related papers (2022-09-15T16:59:19Z) - Exploration-Exploitation in Constrained MDPs [79.23623305214275]
We investigate the exploration-exploitation dilemma in Constrained Markov Decision Processes (CMDPs)
While learning in an unknown CMDP, an agent should trade-off exploration to discover new information about the MDP.
While the agent will eventually learn a good or optimal policy, we do not want the agent to violate the constraints too often during the learning process.
arXiv Detail & Related papers (2020-03-04T17:03:56Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.