Related papers: Monte Carlo Information-Oriented Planning

Monte Carlo Information-Oriented Planning

URL: http://arxiv.org/abs/2103.11345v1
Date: Sun, 21 Mar 2021 09:09:27 GMT
Title: Monte Carlo Information-Oriented Planning
Authors: Vincent Thomas, G\'er\'emy Hutin, Olivier Buffet
Abstract summary: We discuss how to solve information-gathering problems expressed as rho-POMDPs. We build on the POMCP algorithm to propose a Monte Carlo Tree Search for rho-POMDPs.
Score: 6.0158981171030685
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this article, we discuss how to solve information-gathering problems expressed as rho-POMDPs, an extension of Partially Observable Markov Decision Processes (POMDPs) whose reward rho depends on the belief state. Point-based approaches used for solving POMDPs have been extended to solving rho-POMDPs as belief MDPs when its reward rho is convex in B or when it is Lipschitz-continuous. In the present paper, we build on the POMCP algorithm to propose a Monte Carlo Tree Search for rho-POMDPs, aiming for an efficient on-line planner which can be used for any rho function. Adaptations are required due to the belief-dependent rewards to (i) propagate more than one state at a time, and (ii) prevent biases in value estimates. An asymptotic convergence proof to epsilon-optimal values is given when rho is continuous. Experiments are conducted to analyze the algorithms at hand and show that they outperform myopic approaches.

Related papers

Sequential Monte Carlo for Policy Optimization in Continuous POMDPs [9.690099639375456]
We introduce a novel policy optimization framework for continuous partially observable Markov decision processes (POMDPs)<n>Our method casts policy learning as probabilistic inference in a non-Markovian Feynman--Kac model.<n>We demonstrate the effectiveness of our algorithm across standard continuous POMDP benchmarks.
arXiv Detail & Related papers (2025-05-22T14:45:46Z)
Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy. We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z)
Deterministic Uncertainty Propagation for Improved Model-Based Offline Reinforcement Learning [12.490614705930676]
We present a theoretical result demonstrating the strong dependency of suboptimality on the number of Monte Carlo samples taken per Bellman target calculation. Our main contribution is a deterministic approximation to the Bellman target that uses progressive moment matching. We show that it is possible to provide tighter guarantees for the suboptimality of MOMBO than the existing Monte Carlo sampling approaches.
arXiv Detail & Related papers (2024-06-06T13:58:41Z)
Offline Bayesian Aleatoric and Epistemic Uncertainty Quantification and Posterior Value Optimisation in Finite-State MDPs [3.1139806580181006]
We address the challenge of quantifying Bayesian uncertainty in offline use cases of finite-state Markov Decision Processes (MDPs) with unknown dynamics. We use standard Bayesian reinforcement learning methods to capture the posterior uncertainty in MDP parameters. We then analytically compute the first two moments of the return distribution across posterior samples and apply the law of total variance. We highlight the real-world impact and computational scalability of our method by applying it to the AI Clinician problem.
arXiv Detail & Related papers (2024-06-04T16:21:14Z)
On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs) Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z)
Rollout Heuristics for Online Stochastic Contingent Planning [6.185979230964809]
Partially Observable Monte-Carlo Planning is an online algorithm for deciding on the next action to perform. POMDP is highly dependent on the rollout policy to compute good estimates. In this paper, we model POMDPs as contingent planning problems.
arXiv Detail & Related papers (2023-10-03T18:24:47Z)
Provably Efficient UCB-type Algorithms For Learning Predictive State Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs) This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models. In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z)
B$^3$RTDP: A Belief Branch and Bound Real-Time Dynamic Programming Approach to Solving POMDPs [17.956744635160568]
We propose an extension to the RTDP-Bel algorithm which we call Belief Branch and Bound RTDP (B$3$RTDP) Our algorithm uses a bounded value function representation and takes advantage of this in two novel ways. We empirically demonstrate that B$3$RTDP can achieve greater returns in less time than the state-of-the-art SARSOP solver on known POMDP problems.
arXiv Detail & Related papers (2022-10-22T21:42:59Z)
Robust and Adaptive Temporal-Difference Learning Using An Ensemble of Gaussian Processes [70.80716221080118]
The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning. The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs. To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
arXiv Detail & Related papers (2021-12-01T23:15:09Z)
Parallel Stochastic Mirror Descent for MDPs [72.75921150912556]
We consider the problem of learning the optimal policy for infinite-horizon Markov decision processes (MDPs) Some variant of Mirror Descent is proposed for convex programming problems with Lipschitz-continuous functionals. We analyze this algorithm in a general case and obtain an estimate of the convergence rate that does not accumulate errors during the operation of the method.
arXiv Detail & Related papers (2021-02-27T19:28:39Z)
A maximum-entropy approach to off-policy evaluation in average-reward MDPs [54.967872716145656]
This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs) We provide the first finite-sample OPE error bound, extending existing results beyond the episodic and discounted cases. We show that this results in an exponential-family distribution whose sufficient statistics are the features, paralleling maximum-entropy approaches in supervised learning.
arXiv Detail & Related papers (2020-06-17T18:13:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.