Convex Regularization in Monte-Carlo Tree Search
- URL: http://arxiv.org/abs/2007.00391v3
- Date: Tue, 16 Feb 2021 15:14:47 GMT
- Title: Convex Regularization in Monte-Carlo Tree Search
- Authors: Tuan Dam, Carlo D'Eramo, Jan Peters, Joni Pajarinen
- Abstract summary: We introduce a unifying theory on the use of generic convex regularizers in Monte-Carlo Tree Search (MCTS)
We exploit our theoretical framework to introduce novel regularized backup operators for MCTS, based on the relative entropy of the policy update.
We empirically evaluate the proposed operators in AlphaGo and AlphaZero on problems of increasing dimensionality and branching factor.
- Score: 41.11958980731047
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Monte-Carlo planning and Reinforcement Learning (RL) are essential to
sequential decision making. The recent AlphaGo and AlphaZero algorithms have
shown how to successfully combine these two paradigms in order to solve large
scale sequential decision problems. These methodologies exploit a variant of
the well-known UCT algorithm to trade off exploitation of good actions and
exploration of unvisited states, but their empirical success comes at the cost
of poor sample-efficiency and high computation time. In this paper, we overcome
these limitations by considering convex regularization in Monte-Carlo Tree
Search (MCTS), which has been successfully used in RL to efficiently drive
exploration. First, we introduce a unifying theory on the use of generic convex
regularizers in MCTS, deriving the regret analysis and providing guarantees of
exponential convergence rate. Second, we exploit our theoretical framework to
introduce novel regularized backup operators for MCTS, based on the relative
entropy of the policy update, and on the Tsallis entropy of the policy.
Finally, we empirically evaluate the proposed operators in AlphaGo and
AlphaZero on problems of increasing dimensionality and branching factor, from a
toy problem to several Atari games, showing their superiority w.r.t.
representative baselines.
Related papers
- LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [56.273799410256075]
The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path.
The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability.
arXiv Detail & Related papers (2024-10-03T18:12:29Z) - Learning Merton's Strategies in an Incomplete Market: Recursive Entropy
Regularization and Biased Gaussian Exploration [11.774563966512709]
We take the reinforcement learning (RL) approach to learn optimal portfolio policies directly by exploring the unknown market.
We present an analysis of the resulting errors to show how the level of exploration affects the learned policies.
arXiv Detail & Related papers (2023-12-19T02:14:13Z) - Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL)
We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo.
Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z) - Benchmarking Constraint Inference in Inverse Reinforcement Learning [19.314352936252444]
In many real-world problems, the constraints followed by expert agents are often hard to specify mathematically and unknown to the RL agents.
In this paper, we construct a CIRL benchmark in the context of two major application domains: robot control and autonomous driving.
The benchmark, including the information for reproducing the performance of CIRL algorithms, is publicly available at https://github.com/Guiliang/CIRL-benchmarks-public.
arXiv Detail & Related papers (2022-06-20T09:22:20Z) - Policy Gradient Algorithms with Monte Carlo Tree Learning for Non-Markov Decision Processes [3.9311044240639568]
Policy gradient (PG) is a reinforcement learning (RL) approach that optimize a parameterized policy model for an expected return using gradient ascent.
While PG can work well even in non-Markovian environments, it may encounter plateaus or peakiness issues.
In this work, we first introduce Monte Carlo Tree Learning (MCTL), an adaptation of MCTS for online RL. We then explore a combined policy approach of PG and MCTL to leverage their strengths.
arXiv Detail & Related papers (2022-06-02T12:21:40Z) - A Unified Perspective on Value Backup and Exploration in Monte-Carlo
Tree Search [41.11958980731047]
We propose two methods for improving the convergence rate and exploration based on a newly introduced backup operator and entropy regularization.
We show that this theoretical formulation unifies different approaches, including our newly introduced ones, under the same mathematical framework.
In practice, our unified perspective offers a flexible way to balance between exploration and exploitation by tuning the single $alpha$ parameter according to the problem at hand.
arXiv Detail & Related papers (2022-02-11T15:30:08Z) - Collective eXplainable AI: Explaining Cooperative Strategies and Agent
Contribution in Multiagent Reinforcement Learning with Shapley Values [68.8204255655161]
This study proposes a novel approach to explain cooperative strategies in multiagent RL using Shapley values.
Results could have implications for non-discriminatory decision making, ethical and responsible AI-derived decisions or policy making under fairness constraints.
arXiv Detail & Related papers (2021-10-04T10:28:57Z) - Bayesian decision-making under misspecified priors with applications to
meta-learning [64.38020203019013]
Thompson sampling and other sequential decision-making algorithms are popular approaches to tackle explore/exploit trade-offs in contextual bandits.
We show that performance degrades gracefully with misspecified priors.
arXiv Detail & Related papers (2021-07-03T23:17:26Z) - Bayesian Bellman Operators [55.959376449737405]
We introduce a novel perspective on Bayesian reinforcement learning (RL)
Our framework is motivated by the insight that when bootstrapping is introduced, model-free approaches actually infer a posterior over Bellman operators, not value functions.
arXiv Detail & Related papers (2021-06-09T12:20:46Z) - Monte-Carlo Tree Search as Regularized Policy Optimization [47.541849128047865]
We show that AlphaZero's search algorithms are an approximation to the solution of a specific regularized policy optimization problem.
We propose a variant of AlphaZero which uses the exact solution to this policy optimization problem, and show experimentally that it reliably outperforms the original algorithm in multiple domains.
arXiv Detail & Related papers (2020-07-24T13:01:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.