Simplified Belief-Dependent Reward MCTS Planning with Guaranteed Tree
Consistency
- URL: http://arxiv.org/abs/2105.14239v1
- Date: Sat, 29 May 2021 07:25:11 GMT
- Title: Simplified Belief-Dependent Reward MCTS Planning with Guaranteed Tree
Consistency
- Authors: Ori Sztyglic, Andrey Zhitnikov, Vadim Indelman
- Abstract summary: Partially Observable Markov Decision Processes (POMDPs) are notoriously hard to solve.
Most advanced state-of-the-art online solvers leverage ideas of Monte Carlo Tree Search (MCTS)
We present a novel variant to the MCTS algorithm that considers information-theoretic rewards but avoids the need to calculate them completely.
- Score: 11.688030627514532
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Partially Observable Markov Decision Processes (POMDPs) are notoriously hard
to solve. Most advanced state-of-the-art online solvers leverage ideas of Monte
Carlo Tree Search (MCTS). These solvers rapidly converge to the most promising
branches of the belief tree, avoiding the suboptimal sections. Most of these
algorithms are designed to utilize straightforward access to the state reward
and assume the belief-dependent reward is nothing but expectation over the
state reward. Thus, they are inapplicable to a more general and essential
setting of belief-dependent rewards. One example of such reward is differential
entropy approximated using a set of weighted particles of the belief. Such an
information-theoretic reward introduces a significant computational burden. In
this paper, we embed the paradigm of simplification into the MCTS algorithm. In
particular, we present Simplified Information-Theoretic Particle Filter Tree
(SITH-PFT), a novel variant to the MCTS algorithm that considers
information-theoretic rewards but avoids the need to calculate them completely.
We replace the costly calculation of information-theoretic rewards with
adaptive upper and lower bounds. These bounds are easy to calculate and
tightened only by the demand of our algorithm. Crucially, we guarantee
precisely the same belief tree and solution that would be obtained by MCTS,
which explicitly calculates the original information-theoretic rewards. Our
approach is general; namely, any converging to the reward bounds can be easily
plugged-in to achieve substantial speedup without any loss in performance.
Related papers
- Walking the Values in Bayesian Inverse Reinforcement Learning [66.68997022043075]
Key challenge in Bayesian IRL is bridging the computational gap between the hypothesis space of possible rewards and the likelihood.
We propose ValueWalk - a new Markov chain Monte Carlo method based on this insight.
arXiv Detail & Related papers (2024-07-15T17:59:52Z) - No Compromise in Solution Quality: Speeding Up Belief-dependent Continuous POMDPs via Adaptive Multilevel Simplification [6.300736240833814]
Continuous POMDPs with general belief-dependent rewards are notoriously difficult to solve online.
We present a complete provable theory of adaptive multilevel simplification for the setting of a given externally constructed belief tree.
We present three algorithms to accelerate continuous POMDP online planning with belief-dependent rewards.
arXiv Detail & Related papers (2023-10-16T10:59:22Z) - STARC: A General Framework For Quantifying Differences Between Reward
Functions [55.33869271912095]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics.
We show that STARC metrics induce both an upper and a lower bound on worst-case regret.
We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z) - Towards Theoretical Understanding of Inverse Reinforcement Learning [45.3190496371625]
Inverse reinforcement learning (IRL) denotes a powerful family of algorithms for recovering a reward function justifying the behavior demonstrated by an expert agent.
In this paper, we make a step towards closing the theory gap of IRL in the case of finite-horizon problems with a generative model.
arXiv Detail & Related papers (2023-04-25T16:21:10Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Under-Approximating Expected Total Rewards in POMDPs [68.8204255655161]
We consider the optimal expected total reward to reach a goal state in a partially observable Markov decision process (POMDP)
We use mixed-integer linear programming (MILP) to find such minimal probability shifts and experimentally show that our techniques scale quite well.
arXiv Detail & Related papers (2022-01-21T16:43:03Z) - Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off.
We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus.
We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z) - Online POMDP Planning via Simplification [10.508187462682306]
We develop a novel approach to POMDP planning considering belief-dependent rewards.
Our approach is guaranteed to find the optimal solution of the original problem but with substantial speedup.
We validate our approach in simulation using these bounds and where simplification corresponds to reducing the number of samples, exhibiting a significant computational speedup.
arXiv Detail & Related papers (2021-05-11T18:46:08Z) - Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous
Feedback [32.62857394584907]
We study the multi-armed bandit (MAB) problem with composite and anonymous feedback.
We propose adaptive algorithms for both the adversarial and non- adversarial cases.
arXiv Detail & Related papers (2020-12-13T12:25:41Z) - On Effective Parallelization of Monte Carlo Tree Search [51.15940034629022]
Monte Carlo Tree Search (MCTS) is computationally expensive as it requires a substantial number of rollouts to construct the search tree.
How to design effective parallel MCTS algorithms has not been systematically studied and remains poorly understood.
We demonstrate how proposed necessary conditions can be adopted to design more effective parallel MCTS algorithms.
arXiv Detail & Related papers (2020-06-15T21:36:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.