Monte Carlo Information-Oriented Planning
- URL: http://arxiv.org/abs/2103.11345v1
- Date: Sun, 21 Mar 2021 09:09:27 GMT
- Title: Monte Carlo Information-Oriented Planning
- Authors: Vincent Thomas, G\'er\'emy Hutin, Olivier Buffet
- Abstract summary: We discuss how to solve information-gathering problems expressed as rho-POMDPs.
We build on the POMCP algorithm to propose a Monte Carlo Tree Search for rho-POMDPs.
- Score: 6.0158981171030685
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this article, we discuss how to solve information-gathering problems
expressed as rho-POMDPs, an extension of Partially Observable Markov Decision
Processes (POMDPs) whose reward rho depends on the belief state. Point-based
approaches used for solving POMDPs have been extended to solving rho-POMDPs as
belief MDPs when its reward rho is convex in B or when it is
Lipschitz-continuous. In the present paper, we build on the POMCP algorithm to
propose a Monte Carlo Tree Search for rho-POMDPs, aiming for an efficient
on-line planner which can be used for any rho function. Adaptations are
required due to the belief-dependent rewards to (i) propagate more than one
state at a time, and (ii) prevent biases in value estimates. An asymptotic
convergence proof to epsilon-optimal values is given when rho is continuous.
Experiments are conducted to analyze the algorithms at hand and show that they
outperform myopic approaches.
Related papers
- Offline Bayesian Aleatoric and Epistemic Uncertainty Quantification and Posterior Value Optimisation in Finite-State MDPs [3.1139806580181006]
We address the challenge of quantifying Bayesian uncertainty in offline use cases of finite-state Markov Decision Processes (MDPs) with unknown dynamics.
We use standard Bayesian reinforcement learning methods to capture the posterior uncertainty in MDP parameters.
We then analytically compute the first two moments of the return distribution across posterior samples and apply the law of total variance.
We highlight the real-world impact and computational scalability of our method by applying it to the AI Clinician problem.
arXiv Detail & Related papers (2024-06-04T16:21:14Z) - Power Mean Estimation in Stochastic Monte-Carlo Tree_Search [25.058008522872747]
Monte-Carlo Tree Search (MCTS) is a widely-used strategy for online planning that combines Monte-Carlo sampling with forward tree search.
The theoretical foundation of UCT is incomplete due to an error in the logarithmic bonus term for action selection.
This paper introduces an algorithm using the power mean estimator and tailored for MDPs.
arXiv Detail & Related papers (2024-06-04T11:56:37Z) - On the Global Convergence of Policy Gradient in Average Reward Markov
Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs)
Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z) - Rollout Heuristics for Online Stochastic Contingent Planning [6.185979230964809]
Partially Observable Monte-Carlo Planning is an online algorithm for deciding on the next action to perform.
POMDP is highly dependent on the rollout policy to compute good estimates.
In this paper, we model POMDPs as contingent planning problems.
arXiv Detail & Related papers (2023-10-03T18:24:47Z) - Provably Efficient UCB-type Algorithms For Learning Predictive State
Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs)
This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models.
In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z) - B$^3$RTDP: A Belief Branch and Bound Real-Time Dynamic Programming
Approach to Solving POMDPs [17.956744635160568]
We propose an extension to the RTDP-Bel algorithm which we call Belief Branch and Bound RTDP (B$3$RTDP)
Our algorithm uses a bounded value function representation and takes advantage of this in two novel ways.
We empirically demonstrate that B$3$RTDP can achieve greater returns in less time than the state-of-the-art SARSOP solver on known POMDP problems.
arXiv Detail & Related papers (2022-10-22T21:42:59Z) - Robust and Adaptive Temporal-Difference Learning Using An Ensemble of
Gaussian Processes [70.80716221080118]
The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning.
The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs.
To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
arXiv Detail & Related papers (2021-12-01T23:15:09Z) - Rule-based Shielding for Partially Observable Monte-Carlo Planning [78.05638156687343]
We propose two contributions to Partially Observable Monte-Carlo Planning (POMCP)
The first is a method for identifying unexpected actions selected by POMCP with respect to expert prior knowledge of the task.
The second is a shielding approach that prevents POMCP from selecting unexpected actions.
We evaluate our approach on Tiger, a standard benchmark for POMDPs, and a real-world problem related to velocity regulation in mobile robot navigation.
arXiv Detail & Related papers (2021-04-28T14:23:38Z) - Parallel Stochastic Mirror Descent for MDPs [72.75921150912556]
We consider the problem of learning the optimal policy for infinite-horizon Markov decision processes (MDPs)
Some variant of Mirror Descent is proposed for convex programming problems with Lipschitz-continuous functionals.
We analyze this algorithm in a general case and obtain an estimate of the convergence rate that does not accumulate errors during the operation of the method.
arXiv Detail & Related papers (2021-02-27T19:28:39Z) - A maximum-entropy approach to off-policy evaluation in average-reward
MDPs [54.967872716145656]
This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs)
We provide the first finite-sample OPE error bound, extending existing results beyond the episodic and discounted cases.
We show that this results in an exponential-family distribution whose sufficient statistics are the features, paralleling maximum-entropy approaches in supervised learning.
arXiv Detail & Related papers (2020-06-17T18:13:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.