Monte-Carlo Tree Search as Regularized Policy Optimization
- URL: http://arxiv.org/abs/2007.12509v1
- Date: Fri, 24 Jul 2020 13:01:34 GMT
- Title: Monte-Carlo Tree Search as Regularized Policy Optimization
- Authors: Jean-Bastien Grill, Florent Altch\'e, Yunhao Tang, Thomas Hubert,
Michal Valko, Ioannis Antonoglou, R\'emi Munos
- Abstract summary: We show that AlphaZero's search algorithms are an approximation to the solution of a specific regularized policy optimization problem.
We propose a variant of AlphaZero which uses the exact solution to this policy optimization problem, and show experimentally that it reliably outperforms the original algorithm in multiple domains.
- Score: 47.541849128047865
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The combination of Monte-Carlo tree search (MCTS) with deep reinforcement
learning has led to significant advances in artificial intelligence. However,
AlphaZero, the current state-of-the-art MCTS algorithm, still relies on
handcrafted heuristics that are only partially understood. In this paper, we
show that AlphaZero's search heuristics, along with other common ones such as
UCT, are an approximation to the solution of a specific regularized policy
optimization problem. With this insight, we propose a variant of AlphaZero
which uses the exact solution to this policy optimization problem, and show
experimentally that it reliably outperforms the original algorithm in multiple
domains.
Related papers
- LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [56.273799410256075]
The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path.
The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability.
arXiv Detail & Related papers (2024-10-03T18:12:29Z) - Socio-cognitive Optimization of Time-delay Control Problems using
Evolutionary Metaheuristics [89.24951036534168]
Metaheuristics are universal optimization algorithms which should be used for solving difficult problems, unsolvable by classic approaches.
In this paper we aim at constructing novel socio-cognitive metaheuristic based on castes, and apply several versions of this algorithm to optimization of time-delay system model.
arXiv Detail & Related papers (2022-10-23T22:21:10Z) - Exploring the Algorithm-Dependent Generalization of AUPRC Optimization
with List Stability [107.65337427333064]
optimization of the Area Under the Precision-Recall Curve (AUPRC) is a crucial problem for machine learning.
In this work, we present the first trial in the single-dependent generalization of AUPRC optimization.
Experiments on three image retrieval datasets on speak to the effectiveness and soundness of our framework.
arXiv Detail & Related papers (2022-09-27T09:06:37Z) - CITS: Coherent Ising Tree Search Algorithm Towards Solving Combinatorial
Optimization Problems [0.0]
This paper proposes a search algorithm by expanding search space from a Markov chain to a depth limited tree based on SA.
At each iteration, the algorithm will select the best near-optimal solution within the feasible search space by exploring along the tree in the sense of look ahead'
Our results show that above the primals SA and CIM, our high-level tree search strategy is able to provide solutions within fewer epochs for Ising formulated NP-optimization problems.
arXiv Detail & Related papers (2022-03-09T10:07:26Z) - Machine Learning for Online Algorithm Selection under Censored Feedback [71.6879432974126]
In online algorithm selection (OAS), instances of an algorithmic problem class are presented to an agent one after another, and the agent has to quickly select a presumably best algorithm from a fixed set of candidate algorithms.
For decision problems such as satisfiability (SAT), quality typically refers to the algorithm's runtime.
In this work, we revisit multi-armed bandit algorithms for OAS and discuss their capability of dealing with the problem.
We adapt them towards runtime-oriented losses, allowing for partially censored data while keeping a space- and time-complexity independent of the time horizon.
arXiv Detail & Related papers (2021-09-13T18:10:52Z) - Monte-Carlo Graph Search for AlphaZero [15.567057178736402]
We introduce a new, improved search algorithm for AlphaZero that generalizes the search tree to a directed acyclic graph.
In our evaluations, we use the CrazyAra engine on chess and crazyhouse as examples to show that these changes bring significant improvements to AlphaZero.
arXiv Detail & Related papers (2020-12-20T22:51:38Z) - An Asymptotically Optimal Primal-Dual Incremental Algorithm for
Contextual Linear Bandits [129.1029690825929]
We introduce a novel algorithm improving over the state-of-the-art along multiple dimensions.
We establish minimax optimality for any learning horizon in the special case of non-contextual linear bandits.
arXiv Detail & Related papers (2020-10-23T09:12:47Z) - Exploring search space trees using an adapted version of Monte Carlo
tree search for combinatorial optimization problems [0.6882042556551609]
This approach makes use of a algorithm to explore the search space tree of a problem instance.
The algorithm is based on Monte Carlo tree search, a popular algorithm in game playing that is used to explore game trees.
arXiv Detail & Related papers (2020-10-22T08:33:58Z) - On the Convergence of Reinforcement Learning with Monte Carlo Exploring
Starts [5.137144629366217]
A basic simulation-based reinforcement learning algorithm is the Monte Carlo Exploring States (MCES) method.
We investigate the convergence of this algorithm for the case with undiscounted costs, also known as the shortest path problem.
As a side result, we also provide a proof of a version of the supermartingale convergence theorem commonly used in approximation.
arXiv Detail & Related papers (2020-07-21T16:19:09Z) - Convex Regularization in Monte-Carlo Tree Search [41.11958980731047]
We introduce a unifying theory on the use of generic convex regularizers in Monte-Carlo Tree Search (MCTS)
We exploit our theoretical framework to introduce novel regularized backup operators for MCTS, based on the relative entropy of the policy update.
We empirically evaluate the proposed operators in AlphaGo and AlphaZero on problems of increasing dimensionality and branching factor.
arXiv Detail & Related papers (2020-07-01T11:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.