Related papers: Power Mean Estimation in Stochastic Monte-Carlo Tree

Power Mean Estimation in Stochastic Monte-Carlo Tree_Search

URL: http://arxiv.org/abs/2406.02235v1
Date: Tue, 4 Jun 2024 11:56:37 GMT
Title: Power Mean Estimation in Stochastic Monte-Carlo Tree_Search
Authors: Tuan Dam, Odalric-Ambrym Maillard, Emilie Kaufmann,
Abstract summary: Monte-Carlo Tree Search (MCTS) is a widely-used strategy for online planning that combines Monte-Carlo sampling with forward tree search. The theoretical foundation of UCT is incomplete due to an error in the logarithmic bonus term for action selection. This paper introduces an algorithm using the power mean estimator and tailored for MDPs.
Score: 25.058008522872747
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Monte-Carlo Tree Search (MCTS) is a widely-used strategy for online planning that combines Monte-Carlo sampling with forward tree search. Its success relies on the Upper Confidence bound for Trees (UCT) algorithm, an extension of the UCB method for multi-arm bandits. However, the theoretical foundation of UCT is incomplete due to an error in the logarithmic bonus term for action selection, leading to the development of Fixed-Depth-MCTS with a polynomial exploration bonus to balance exploration and exploitation~\citep{shah2022journal}. Both UCT and Fixed-Depth-MCTS suffer from biased value estimation: the weighted sum underestimates the optimal value, while the maximum valuation overestimates it~\citep{coulom2006efficient}. The power mean estimator offers a balanced solution, lying between the average and maximum values. Power-UCT~\citep{dam2019generalized} incorporates this estimator for more accurate value estimates but its theoretical analysis remains incomplete. This paper introduces Stochastic-Power-UCT, an MCTS algorithm using the power mean estimator and tailored for stochastic MDPs. We analyze its polynomial convergence in estimating root node values and show that it shares the same convergence rate of $\mathcal{O}(n^{-1/2})$, with $n$ is the number of visited trajectories, as Fixed-Depth-MCTS, with the latter being a special case of the former. Our theoretical results are validated with empirical tests across various stochastic MDP environments.

Related papers

Reward Model Generalization for Compute-Aware Test-Time Reasoning [21.05692631562457]
External test-time reasoning enhances large language models (LLMs) by decoupling generation and selection.<n>A central challenge in this setting is test-time compute optimality (TCO), i.e., how to maximize answer accuracy under a fixed inference budget.<n>We analyze how the generalization error of the PRM affects compute efficiency and reasoning performance.<n>Motivated by this analysis, we propose Compute-Aware Tree Search (CATS), an actor-critic framework that dynamically controls search behavior.
arXiv Detail & Related papers (2025-05-23T16:12:12Z)
Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy. We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z)
Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation [53.17668583030862]
We study infinite-horizon average-reward Markov decision processes (AMDPs) in the context of general function approximation. We propose a novel algorithmic framework named Local-fitted Optimization with OPtimism (LOOP) We show that LOOP achieves a sublinear $tildemathcalO(mathrmpoly(d, mathrmsp(V*)) sqrtTbeta )$ regret, where $d$ and $beta$ correspond to AGEC and log-covering number of the hypothesis class respectively
arXiv Detail & Related papers (2024-04-19T06:24:22Z)
Leveraging Nested MLMC for Sequential Neural Posterior Estimation with Intractable Likelihoods [0.8287206589886881]
SNPE techniques are proposed for dealing with simulation-based models with intractable likelihoods. In this paper, we propose a nested APT method to estimate the involved nested expectation. Since the nested estimators for the loss function and its gradient are biased, we make use of unbiased multi-level Monte Carlo (MLMC) estimators.
arXiv Detail & Related papers (2024-01-30T06:29:41Z)
Unbiased Kinetic Langevin Monte Carlo with Inexact Gradients [0.8749675983608172]
We present an unbiased method for posterior means based on kinetic Langevin dynamics. Our proposed estimator is unbiased, attains finite variance, and satisfies a central limit theorem. Our results demonstrate that in large-scale applications, the unbiased algorithm we present can be 2-3 orders of magnitude more efficient than the gold-standard" randomized Hamiltonian Monte Carlo.
arXiv Detail & Related papers (2023-11-08T21:19:52Z)
Monte-Carlo tree search with uncertainty propagation via optimal transport [27.931569422467835]
This paper introduces a novel backup strategy for Monte-Carlo Tree Search (MCTS) We adopt a probabilistic approach, modeling both value and action-value nodes as Gaussian distributions. We provide theoretical guarantees of convergence to the optimal policy, and an empirical evaluation on several and partially observable environments.
arXiv Detail & Related papers (2023-09-19T16:32:04Z)
Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL) We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z)
Truncating Trajectories in Monte Carlo Reinforcement Learning [48.97155920826079]
In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cumulative discounted sum of an external reward signal. We propose an a-priori budget allocation strategy that leads to the collection of trajectories of different lengths. We show that an appropriate truncation of the trajectories can succeed in improving performance.
arXiv Detail & Related papers (2023-05-07T19:41:57Z)
Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED) PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z)
Exponential Family Model-Based Reinforcement Learning via Score Matching [97.31477125728844]
We propose an optimistic model-based algorithm, dubbed SMRL, for finitehorizon episodic reinforcement learning (RL) SMRL uses score matching, an unnormalized density estimation technique that enables efficient estimation of the model parameter by ridge regression.
arXiv Detail & Related papers (2021-12-28T15:51:07Z)
Mean-Square Analysis with An Application to Optimal Dimension Dependence of Langevin Monte Carlo [60.785586069299356]
This work provides a general framework for the non-asymotic analysis of sampling error in 2-Wasserstein distance. Our theoretical analysis is further validated by numerical experiments.
arXiv Detail & Related papers (2021-09-08T18:00:05Z)
Optimal Algorithms for Stochastic Multi-Armed Bandits with Heavy Tailed Rewards [24.983866845065926]
We consider multi-armed bandits with heavy-tailed rewards, whose $p$-th moment is bounded by a constant $nu_p$ for $1pleq2$. We propose a novel robust estimator which does not require $nu_p$ as prior information. We show that an error probability of the proposed estimator decays exponentially fast.
arXiv Detail & Related papers (2020-10-24T10:44:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.