Monte-Carlo tree search with uncertainty propagation via optimal
transport
- URL: http://arxiv.org/abs/2309.10737v1
- Date: Tue, 19 Sep 2023 16:32:04 GMT
- Title: Monte-Carlo tree search with uncertainty propagation via optimal
transport
- Authors: Tuan Dam, Pascal Stenger, Lukas Schneider, Joni Pajarinen, Carlo
D'Eramo, Odalric-Ambrym Maillard
- Abstract summary: This paper introduces a novel backup strategy for Monte-Carlo Tree Search (MCTS)
We adopt a probabilistic approach, modeling both value and action-value nodes as Gaussian distributions.
We provide theoretical guarantees of convergence to the optimal policy, and an empirical evaluation on several and partially observable environments.
- Score: 27.931569422467835
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a novel backup strategy for Monte-Carlo Tree Search
(MCTS) designed for highly stochastic and partially observable Markov decision
processes. We adopt a probabilistic approach, modeling both value and
action-value nodes as Gaussian distributions. We introduce a novel backup
operator that computes value nodes as the Wasserstein barycenter of their
action-value children nodes; thus, propagating the uncertainty of the estimate
across the tree to the root node. We study our novel backup operator when using
a novel combination of $L^1$-Wasserstein barycenter with $\alpha$-divergence,
by drawing a notable connection to the generalized mean backup operator. We
complement our probabilistic backup operator with two sampling strategies,
based on optimistic selection and Thompson sampling, obtaining our Wasserstein
MCTS algorithm. We provide theoretical guarantees of asymptotic convergence to
the optimal policy, and an empirical evaluation on several stochastic and
partially observable environments, where our approach outperforms well-known
related baselines.
Related papers
- Power Mean Estimation in Stochastic Monte-Carlo Tree_Search [25.058008522872747]
Monte-Carlo Tree Search (MCTS) is a widely-used strategy for online planning that combines Monte-Carlo sampling with forward tree search.
The theoretical foundation of UCT is incomplete due to an error in the logarithmic bonus term for action selection.
This paper introduces an algorithm using the power mean estimator and tailored for MDPs.
arXiv Detail & Related papers (2024-06-04T11:56:37Z) - Poisson Process for Bayesian Optimization [126.51200593377739]
We propose a ranking-based surrogate model based on the Poisson process and introduce an efficient BO framework, namely Poisson Process Bayesian Optimization (PoPBO)
Compared to the classic GP-BO method, our PoPBO has lower costs and better robustness to noise, which is verified by abundant experiments.
arXiv Detail & Related papers (2024-02-05T02:54:50Z) - Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL)
We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo.
Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z) - Double Pessimism is Provably Efficient for Distributionally Robust
Offline Reinforcement Learning: Generic Algorithm and Robust Partial Coverage [15.858892479232656]
We study robust offline reinforcement learning (robust offline RL)
We propose a generic algorithm framework called Doubly Pessimistic Model-based Policy Optimization ($P2MPO$)
We show that $P2MPO$ enjoys a $tildemathcalO(n-1/2)$ convergence rate, where $n$ is the dataset size.
arXiv Detail & Related papers (2023-05-16T17:58:05Z) - Langevin Monte Carlo for Contextual Bandits [72.00524614312002]
Langevin Monte Carlo Thompson Sampling (LMC-TS) is proposed to directly sample from the posterior distribution in contextual bandits.
We prove that the proposed algorithm achieves the same sublinear regret bound as the best Thompson sampling algorithms for a special case of contextual bandits.
arXiv Detail & Related papers (2022-06-22T17:58:23Z) - Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes.
We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm.
textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z) - On Reward-Free RL with Kernel and Neural Function Approximations:
Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function.
We tackle this problem under the context of function approximation, leveraging powerful function approximators.
We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z) - Navigating to the Best Policy in Markov Decision Processes [68.8204255655161]
We investigate the active pure exploration problem in Markov Decision Processes.
Agent sequentially selects actions and, from the resulting system trajectory, aims at the best as fast as possible.
arXiv Detail & Related papers (2021-06-05T09:16:28Z) - Learning Representations using Spectral-Biased Random Walks on Graphs [18.369974607582584]
We study how much a probabilistic bias in this process affects the quality of the nodes picked by the process.
We succinctly capture this neighborhood as a probability measure based on the spectrum of the node's neighborhood subgraph represented as a normalized laplacian matrix.
We empirically evaluate our approach against several state-of-the-art node embedding techniques on a wide variety of real-world datasets.
arXiv Detail & Related papers (2020-05-19T20:42:43Z) - Bayesian optimization for backpropagation in Monte-Carlo tree search [1.52292571922932]
We present two methods, Softmax MCTS and Monotone MCTS, which generalize previous attempts to improve upon the backpropagation strategy.
We conduct experiments on computer Go, where the returns are given by a deep value neural network, and show that our proposed framework outperforms previous methods.
arXiv Detail & Related papers (2020-01-25T14:33:38Z) - Bayesian Quantile and Expectile Optimisation [3.3878745408530833]
We propose new variational models for Bayesian quantile and expectile regression that are well-suited for heteroscedastic noise settings.
Our strategies can directly optimise for the quantile and expectile, without requiring replicating observations or assuming a parametric form for the noise.
As illustrated in the experimental section, the proposed approach clearly outperforms the state of the art in the heteroscedastic, non-Gaussian case.
arXiv Detail & Related papers (2020-01-12T20:51:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.