Related papers: Variance-Aware Prior-Based Tree Policies for Monte Carlo Tree Search

Variance-Aware Prior-Based Tree Policies for Monte Carlo Tree Search

URL: http://arxiv.org/abs/2512.21648v1
Date: Thu, 25 Dec 2025 12:25:26 GMT
Title: Variance-Aware Prior-Based Tree Policies for Monte Carlo Tree Search
Authors: Maximilian Weichart,
Abstract summary: Monte Carlo Tree Search (MCTS) has profoundly influenced reinforcement learning (RL)<n>We introduce Inverse-RPO, a general methodology that systematically derives prior-based UCTs from any prior-free UCB.<n>Experiments indicate that these variance-aware prior-based UCTs outperform PUCT across multiple benchmarks without incurring additional computational cost.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Monte Carlo Tree Search (MCTS) has profoundly influenced reinforcement learning (RL) by integrating planning and learning in tasks requiring long-horizon reasoning, exemplified by the AlphaZero family of algorithms. Central to MCTS is the search strategy, governed by a tree policy based on an upper confidence bound (UCB) applied to trees (UCT). A key factor in the success of AlphaZero is the introduction of a prior term in the UCB1-based tree policy PUCT, which improves exploration efficiency and thus accelerates training. While many alternative UCBs with stronger theoretical guarantees than UCB1 exist, extending them to prior-based UCTs has been challenging, since PUCT was derived empirically rather than from first principles. Recent work retrospectively justified PUCT by framing MCTS as a regularized policy optimization (RPO) problem. Building on this perspective, we introduce Inverse-RPO, a general methodology that systematically derives prior-based UCTs from any prior-free UCB. Applying this method to the variance-aware UCB-V, we obtain two new prior-based tree policies that incorporate variance estimates into the search. Experiments indicate that these variance-aware prior-based UCTs outperform PUCT across multiple benchmarks without incurring additional computational cost. We also provide an extension of the mctx library supporting variance-aware UCTs, showing that the required code changes are minimal and intended to facilitate further research on principled prior-based UCTs. Code: github.com/Max-We/inverse-rpo.

Related papers

Accelerating Monte-Carlo Tree Search with Optimized Posterior Policies [0.0]
RMCTS is more than 40 times faster than MCTS-UCB when searching a single root state.<n>We find that RMCTS-trained networks match the quality of MCTS-UCB-trained networks in roughly one-third of the training time.
arXiv Detail & Related papers (2026-01-03T23:38:43Z)
TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling [65.46347858249295]
TreePO is a self-guided rollout algorithm that views sequence generation as a tree-structured searching process.<n>TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity.
arXiv Detail & Related papers (2025-08-24T16:52:37Z)
Stochastic Primal-Dual Double Block-Coordinate for Two-way Partial AUC Maximization [45.99743804547533]
Two-way partial AUCAUC is a critical performance metric for binary classification with imbalanced data.<n>Existing algorithms for TPAUC optimization remain under-explored.<n>We introduce two innovative double-coordinate block-coordinate algorithms for TPAUC optimization.
arXiv Detail & Related papers (2025-05-28T03:55:05Z)
Power Mean Estimation in Stochastic Monte-Carlo Tree_Search [25.058008522872747]
Monte-Carlo Tree Search (MCTS) is a widely-used strategy for online planning that combines Monte-Carlo sampling with forward tree search. The theoretical foundation of UCT is incomplete due to an error in the logarithmic bonus term for action selection. This paper introduces an algorithm using the power mean estimator and tailored for MDPs.
arXiv Detail & Related papers (2024-06-04T11:56:37Z)
Provably Efficient UCB-type Algorithms For Learning Predictive State Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs) This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models. In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z)
Scale-Adaptive Balancing of Exploration and Exploitation in Classical Planning [1.6574413179773757]
We show that a more detailed theoretical understanding of MAB literature helps improve existing planning algorithms. We propose GreedyUCT-Normal, a MCTS/THTS algorithm with UCB1-Normal bandit for agile classical planning.
arXiv Detail & Related papers (2023-05-16T22:46:37Z)
Reinforcement Learning for Branch-and-Bound Optimisation using Retrospective Trajectories [72.15369769265398]
Machine learning has emerged as a promising paradigm for branching. We propose retro branching; a simple yet effective approach to RL for branching. We outperform the current state-of-the-art RL branching algorithm by 3-5x and come within 20% of the best IL method's performance on MILPs with 500 constraints and 1000 variables.
arXiv Detail & Related papers (2022-05-28T06:08:07Z)
Robust Predictable Control [149.71263296079388]
We show that our method achieves much tighter compression than prior methods, achieving up to 5x higher reward than a standard information bottleneck. We also demonstrate that our method learns policies that are more robust and generalize better to new tasks.
arXiv Detail & Related papers (2021-09-07T17:29:34Z)
Principled Exploration via Optimistic Bootstrapping and Backward Induction [84.78836146128238]
We propose a principled exploration method for Deep Reinforcement Learning (DRL) through Optimistic Bootstrapping and Backward Induction (OB2I) OB2I constructs a general-purpose UCB-bonus through non-parametric bootstrap in DRL. We build theoretical connections between the proposed UCB-bonus and the LSVI-UCB in a linear setting.
arXiv Detail & Related papers (2021-05-13T01:15:44Z)
SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning [102.78958681141577]
We present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy deep reinforcement learning algorithms. SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration.
arXiv Detail & Related papers (2020-07-09T17:08:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.