Related papers: A Unified Perspective on Value Backup and Exploration in Monte-Carlo Tree Search

A Unified Perspective on Value Backup and Exploration in Monte-Carlo Tree Search

URL: http://arxiv.org/abs/2202.07071v1
Date: Fri, 11 Feb 2022 15:30:08 GMT
Title: A Unified Perspective on Value Backup and Exploration in Monte-Carlo Tree Search
Authors: Tuan Dam, Carlo D'Eramo, Jan Peters, Joni Pajarinen
Abstract summary: We propose two methods for improving the convergence rate and exploration based on a newly introduced backup operator and entropy regularization. We show that this theoretical formulation unifies different approaches, including our newly introduced ones, under the same mathematical framework. In practice, our unified perspective offers a flexible way to balance between exploration and exploitation by tuning the single $alpha$ parameter according to the problem at hand.
Score: 41.11958980731047
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Monte-Carlo Tree Search (MCTS) is a class of methods for solving complex decision-making problems through the synergy of Monte-Carlo planning and Reinforcement Learning (RL). The highly combinatorial nature of the problems commonly addressed by MCTS requires the use of efficient exploration strategies for navigating the planning tree and quickly convergent value backup methods. These crucial problems are particularly evident in recent advances that combine MCTS with deep neural networks for function approximation. In this work, we propose two methods for improving the convergence rate and exploration based on a newly introduced backup operator and entropy regularization. We provide strong theoretical guarantees to bound convergence rate, approximation error, and regret of our methods. Moreover, we introduce a mathematical framework based on the use of the $\alpha$-divergence for backup and exploration in MCTS. We show that this theoretical formulation unifies different approaches, including our newly introduced ones, under the same mathematical framework, allowing to obtain different methods by simply changing the value of $\alpha$. In practice, our unified perspective offers a flexible way to balance between exploration and exploitation by tuning the single $\alpha$ parameter according to the problem at hand. We validate our methods through a rigorous empirical study from basic toy problems to the complex Atari games, and including both MDP and POMDP problems.

Related papers

MCTS-RAG: Enhancing Retrieval-Augmented Generation with Monte Carlo Tree Search [27.378904180238557]
We introduce MCTS-RAG, a novel approach that enhances the reasoning capabilities of small language models on knowledge-intensive tasks. Unlike standard RAG methods, which typically retrieve information independently from reasoning, MCTS-RAG combines structured reasoning with adaptive retrieval. This integrated approach enhances decision-making, reduces hallucinations, and ensures improved factual accuracy and response consistency.
arXiv Detail & Related papers (2025-03-26T17:46:08Z)
LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [56.273799410256075]
The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path. The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability.
arXiv Detail & Related papers (2024-10-03T18:12:29Z)
Randomized Exploration in Cooperative Multi-Agent Reinforcement Learning [15.46907000938726]
We present the first study on provably efficient randomized exploration in cooperative multi-agent reinforcement learning (MARL) We propose a unified algorithm framework for randomized exploration in parallel Markov Decision Processes (MDPs), and two Thompson Sampling (TS)-type algorithms, CoopTS-PHE and CoopTS-LMC. We evaluate our proposed method on multiple parallel RL environments, including a deep exploration problem (textiti.e., $N$-chain), a video game, and a real-world problem in energy systems.
arXiv Detail & Related papers (2024-04-16T17:01:38Z)
An Integrated Framework Integrating Monte Carlo Tree Search and Supervised Learning for Train Timetabling Problem [0.0]
The single-track railway train timetabling problem (TTP) is an important and complex problem. This article proposes an integrated Monte Carlo Tree Search (MCTS) computing framework that combines methods, unsupervised learning methods, and supervised learning methods for solving TTP in discrete action spaces.
arXiv Detail & Related papers (2023-11-02T03:39:14Z)
Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL) We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z)
Feature Acquisition using Monte Carlo Tree Search [18.76745359031975]
Feature acquisition algorithms address the problem of acquiring informative features while balancing the costs of acquisition to improve the learning performances of ML models. Previous approaches have focused on calculating the expected utility values of features to determine the acquisition sequences. In comparison to previous approaches, we focus on 1) formulating the feature acquisition problem as a MDP and applying Monte Carlo Tree Search, 2) calculating the intermediary rewards for each acquisition step based on model improvements and acquisition costs, and 3) simultaneously optimizing model improvement and acquisition costs with multi-objective Monte Carlo Tree Search.
arXiv Detail & Related papers (2022-12-21T20:53:44Z)
Model-Free $\mu$ Synthesis via Adversarial Reinforcement Learning [2.2725929250900947]
We develop a model-free version of the well-known $DK$-iteration for solving state-feedback $mu$ synthesis with static $D$-scaling. In the proposed algorithm, the $K$ step mimics the classical central path algorithm via a recently-developed double-loop adversarial RL method as a subroutine. Our study sheds new light on the connections between adversarial RL and robust control.
arXiv Detail & Related papers (2021-11-30T16:29:49Z)
Efficient Model-Based Multi-Agent Mean-Field Reinforcement Learning [89.31889875864599]
We propose an efficient model-based reinforcement learning algorithm for learning in multi-agent systems. Our main theoretical contributions are the first general regret bounds for model-based reinforcement learning for MFC. We provide a practical parametrization of the core optimization problem.
arXiv Detail & Related papers (2021-07-08T18:01:02Z)
SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning [102.78958681141577]
We present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy deep reinforcement learning algorithms. SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration.
arXiv Detail & Related papers (2020-07-09T17:08:44Z)
Convex Regularization in Monte-Carlo Tree Search [41.11958980731047]
We introduce a unifying theory on the use of generic convex regularizers in Monte-Carlo Tree Search (MCTS) We exploit our theoretical framework to introduce novel regularized backup operators for MCTS, based on the relative entropy of the policy update. We empirically evaluate the proposed operators in AlphaGo and AlphaZero on problems of increasing dimensionality and branching factor.
arXiv Detail & Related papers (2020-07-01T11:29:08Z)
Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs) We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model. Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.