POLY-HOOT: Monte-Carlo Planning in Continuous Space MDPs with
Non-Asymptotic Analysis
- URL: http://arxiv.org/abs/2006.04672v2
- Date: Wed, 30 Dec 2020 05:21:05 GMT
- Title: POLY-HOOT: Monte-Carlo Planning in Continuous Space MDPs with
Non-Asymptotic Analysis
- Authors: Weichao Mao, Kaiqing Zhang, Qiaomin Xie, Tamer Ba\c{s}ar
- Abstract summary: We consider Monte-Carlo planning in an environment with continuous state-action spaces.
We introduce Poly-HOOT, an algorithm that augments Monte-Carlo planning with a continuous armed bandit strategy.
We investigate for the first time, for the first time, the regret of the enhanced HOO algorithm in non-stationary bandit problems.
- Score: 24.373900721120286
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Monte-Carlo planning, as exemplified by Monte-Carlo Tree Search (MCTS), has
demonstrated remarkable performance in applications with finite spaces. In this
paper, we consider Monte-Carlo planning in an environment with continuous
state-action spaces, a much less understood problem with important applications
in control and robotics. We introduce POLY-HOOT, an algorithm that augments
MCTS with a continuous armed bandit strategy named Hierarchical Optimistic
Optimization (HOO) (Bubeck et al., 2011). Specifically, we enhance HOO by using
an appropriate polynomial, rather than logarithmic, bonus term in the upper
confidence bounds. Such a polynomial bonus is motivated by its empirical
successes in AlphaGo Zero (Silver et al., 2017b), as well as its significant
role in achieving theoretical guarantees of finite space MCTS (Shah et al.,
2019). We investigate, for the first time, the regret of the enhanced HOO
algorithm in non-stationary bandit problems. Using this result as a building
block, we establish non-asymptotic convergence guarantees for POLY-HOOT: the
value estimate converges to an arbitrarily small neighborhood of the optimal
value function at a polynomial rate. We further provide experimental results
that corroborate our theoretical findings.
Related papers
- Power Mean Estimation in Stochastic Monte-Carlo Tree_Search [25.058008522872747]
Monte-Carlo Tree Search (MCTS) is a widely-used strategy for online planning that combines Monte-Carlo sampling with forward tree search.
The theoretical foundation of UCT is incomplete due to an error in the logarithmic bonus term for action selection.
This paper introduces an algorithm using the power mean estimator and tailored for MDPs.
arXiv Detail & Related papers (2024-06-04T11:56:37Z) - Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL)
We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo.
Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z) - Learning Logic Specifications for Soft Policy Guidance in POMCP [71.69251176275638]
Partially Observable Monte Carlo Planning (POMCP) is an efficient solver for Partially Observable Markov Decision Processes (POMDPs)
POMCP suffers from sparse reward function, namely, rewards achieved only when the final goal is reached.
In this paper, we use inductive logic programming to learn logic specifications from traces of POMCP executions.
arXiv Detail & Related papers (2023-03-16T09:37:10Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Policy Gradient Algorithms with Monte Carlo Tree Learning for Non-Markov Decision Processes [3.9311044240639568]
Policy gradient (PG) is a reinforcement learning (RL) approach that optimize a parameterized policy model for an expected return using gradient ascent.
While PG can work well even in non-Markovian environments, it may encounter plateaus or peakiness issues.
In this work, we first introduce Monte Carlo Tree Learning (MCTL), an adaptation of MCTS for online RL. We then explore a combined policy approach of PG and MCTL to leverage their strengths.
arXiv Detail & Related papers (2022-06-02T12:21:40Z) - Robust and Adaptive Temporal-Difference Learning Using An Ensemble of
Gaussian Processes [70.80716221080118]
The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning.
The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs.
To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
arXiv Detail & Related papers (2021-12-01T23:15:09Z) - Monte Carlo Information-Oriented Planning [6.0158981171030685]
We discuss how to solve information-gathering problems expressed as rho-POMDPs.
We build on the POMCP algorithm to propose a Monte Carlo Tree Search for rho-POMDPs.
arXiv Detail & Related papers (2021-03-21T09:09:27Z) - Convex Regularization in Monte-Carlo Tree Search [41.11958980731047]
We introduce a unifying theory on the use of generic convex regularizers in Monte-Carlo Tree Search (MCTS)
We exploit our theoretical framework to introduce novel regularized backup operators for MCTS, based on the relative entropy of the policy update.
We empirically evaluate the proposed operators in AlphaGo and AlphaZero on problems of increasing dimensionality and branching factor.
arXiv Detail & Related papers (2020-07-01T11:29:08Z) - Free Energy Wells and Overlap Gap Property in Sparse PCA [81.64027805404483]
We study a variant of the sparse PCA (principal component analysis) problem in the "hard" regime.
We show bounds on the depth of free energy wells for various Gibbs measures naturally associated to the problem.
We prove that the Overlap Gap Property (OGP) holds in a significant part of the hard regime.
arXiv Detail & Related papers (2020-06-18T17:18:02Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.