Related papers: Convex Regularization in Monte-Carlo Tree Search

Convex Regularization in Monte-Carlo Tree Search

URL: http://arxiv.org/abs/2007.00391v3
Date: Tue, 16 Feb 2021 15:14:47 GMT
Title: Convex Regularization in Monte-Carlo Tree Search
Authors: Tuan Dam, Carlo D'Eramo, Jan Peters, Joni Pajarinen
Abstract summary: We introduce a unifying theory on the use of generic convex regularizers in Monte-Carlo Tree Search (MCTS) We exploit our theoretical framework to introduce novel regularized backup operators for MCTS, based on the relative entropy of the policy update. We empirically evaluate the proposed operators in AlphaGo and AlphaZero on problems of increasing dimensionality and branching factor.
Score: 41.11958980731047
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Monte-Carlo planning and Reinforcement Learning (RL) are essential to sequential decision making. The recent AlphaGo and AlphaZero algorithms have shown how to successfully combine these two paradigms in order to solve large scale sequential decision problems. These methodologies exploit a variant of the well-known UCT algorithm to trade off exploitation of good actions and exploration of unvisited states, but their empirical success comes at the cost of poor sample-efficiency and high computation time. In this paper, we overcome these limitations by considering convex regularization in Monte-Carlo Tree Search (MCTS), which has been successfully used in RL to efficiently drive exploration. First, we introduce a unifying theory on the use of generic convex regularizers in MCTS, deriving the regret analysis and providing guarantees of exponential convergence rate. Second, we exploit our theoretical framework to introduce novel regularized backup operators for MCTS, based on the relative entropy of the policy update, and on the Tsallis entropy of the policy. Finally, we empirically evaluate the proposed operators in AlphaGo and AlphaZero on problems of increasing dimensionality and branching factor, from a toy problem to several Atari games, showing their superiority w.r.t. representative baselines.

Related papers

StaQ it! Growing neural networks for Policy Mirror Descent [4.672862669694739]
In Reinforcement Learning (RL), regularization has emerged as a popular tool both in theory and practice.<n>We propose and analyze PMD-like algorithms that only keep the last $M$ Q-functions in memory.<n>We show that for finite and large enough $M$, a convergent algorithm can be derived, introducing no error in the policy update.
arXiv Detail & Related papers (2025-06-16T18:00:01Z)
DGRO: Enhancing LLM Reasoning via Exploration-Exploitation Control and Reward Variance Management [18.953750405635393]
Decoupled Group Reward Optimization (DGRO) is a general RL algorithm for Large Language Models (LLMs) reasoning.<n>We show that DGRO achieves state-of-the-art performance on the Logic dataset with an average accuracy of 96.9%, and demonstrates strong generalization across mathematical benchmarks.
arXiv Detail & Related papers (2025-05-19T10:44:49Z)
LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [56.273799410256075]
The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path. The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability.
arXiv Detail & Related papers (2024-10-03T18:12:29Z)
Learning Merton's Strategies in an Incomplete Market: Recursive Entropy Regularization and Biased Gaussian Exploration [11.774563966512709]
We take the reinforcement learning (RL) approach to learn optimal portfolio policies directly by exploring the unknown market. We present an analysis of the resulting errors to show how the level of exploration affects the learned policies.
arXiv Detail & Related papers (2023-12-19T02:14:13Z)
Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL) We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z)
Benchmarking Constraint Inference in Inverse Reinforcement Learning [19.314352936252444]
In many real-world problems, the constraints followed by expert agents are often hard to specify mathematically and unknown to the RL agents. In this paper, we construct a CIRL benchmark in the context of two major application domains: robot control and autonomous driving. The benchmark, including the information for reproducing the performance of CIRL algorithms, is publicly available at https://github.com/Guiliang/CIRL-benchmarks-public.
arXiv Detail & Related papers (2022-06-20T09:22:20Z)
Policy Gradient Algorithms with Monte Carlo Tree Learning for Non-Markov Decision Processes [3.9311044240639568]
Policy gradient (PG) is a reinforcement learning (RL) approach that optimize a parameterized policy model for an expected return using gradient ascent. While PG can work well even in non-Markovian environments, it may encounter plateaus or peakiness issues. In this work, we first introduce Monte Carlo Tree Learning (MCTL), an adaptation of MCTS for online RL. We then explore a combined policy approach of PG and MCTL to leverage their strengths.
arXiv Detail & Related papers (2022-06-02T12:21:40Z)
A Unified Perspective on Value Backup and Exploration in Monte-Carlo Tree Search [41.11958980731047]
We propose two methods for improving the convergence rate and exploration based on a newly introduced backup operator and entropy regularization. We show that this theoretical formulation unifies different approaches, including our newly introduced ones, under the same mathematical framework. In practice, our unified perspective offers a flexible way to balance between exploration and exploitation by tuning the single $alpha$ parameter according to the problem at hand.
arXiv Detail & Related papers (2022-02-11T15:30:08Z)
Collective eXplainable AI: Explaining Cooperative Strategies and Agent Contribution in Multiagent Reinforcement Learning with Shapley Values [68.8204255655161]
This study proposes a novel approach to explain cooperative strategies in multiagent RL using Shapley values. Results could have implications for non-discriminatory decision making, ethical and responsible AI-derived decisions or policy making under fairness constraints.
arXiv Detail & Related papers (2021-10-04T10:28:57Z)
Bayesian decision-making under misspecified priors with applications to meta-learning [64.38020203019013]
Thompson sampling and other sequential decision-making algorithms are popular approaches to tackle explore/exploit trade-offs in contextual bandits. We show that performance degrades gracefully with misspecified priors.
arXiv Detail & Related papers (2021-07-03T23:17:26Z)
Bayesian Bellman Operators [55.959376449737405]
We introduce a novel perspective on Bayesian reinforcement learning (RL) Our framework is motivated by the insight that when bootstrapping is introduced, model-free approaches actually infer a posterior over Bellman operators, not value functions.
arXiv Detail & Related papers (2021-06-09T12:20:46Z)
Monte-Carlo Tree Search as Regularized Policy Optimization [47.541849128047865]
We show that AlphaZero's search algorithms are an approximation to the solution of a specific regularized policy optimization problem. We propose a variant of AlphaZero which uses the exact solution to this policy optimization problem, and show experimentally that it reliably outperforms the original algorithm in multiple domains.
arXiv Detail & Related papers (2020-07-24T13:01:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.