Targeted Search Control in AlphaZero for Effective Policy Improvement
- URL: http://arxiv.org/abs/2302.12359v1
- Date: Thu, 23 Feb 2023 22:50:24 GMT
- Title: Targeted Search Control in AlphaZero for Effective Policy Improvement
- Authors: Alexandre Trudeau, Michael Bowling
- Abstract summary: We introduce Go-Exploit, a novel search control strategy for AlphaZero.
Go-Exploit samples the start state of its self-play trajectories from an archive of states of interest.
Go-Exploit learns with a greater sample efficiency than standard AlphaZero.
- Score: 93.30151539224144
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: AlphaZero is a self-play reinforcement learning algorithm that achieves
superhuman play in chess, shogi, and Go via policy iteration. To be an
effective policy improvement operator, AlphaZero's search requires accurate
value estimates for the states appearing in its search tree. AlphaZero trains
upon self-play matches beginning from the initial state of a game and only
samples actions over the first few moves, limiting its exploration of states
deeper in the game tree. We introduce Go-Exploit, a novel search control
strategy for AlphaZero. Go-Exploit samples the start state of its self-play
trajectories from an archive of states of interest. Beginning self-play
trajectories from varied starting states enables Go-Exploit to more effectively
explore the game tree and to learn a value function that generalizes better.
Producing shorter self-play trajectories allows Go-Exploit to train upon more
independent value targets, improving value training. Finally, the exploration
inherent in Go-Exploit reduces its need for exploratory actions, enabling it to
train under more exploitative policies. In the games of Connect Four and 9x9
Go, we show that Go-Exploit learns with a greater sample efficiency than
standard AlphaZero, resulting in stronger performance against reference
opponents and in head-to-head play. We also compare Go-Exploit to KataGo, a
more sample efficient reimplementation of AlphaZero, and demonstrate that
Go-Exploit has a more effective search control strategy. Furthermore,
Go-Exploit's sample efficiency improves when KataGo's other innovations are
incorporated.
Related papers
- AlphaZero Gomoku [9.434566356382529]
We broaden the use of AlphaZero to Gomoku, an age-old tactical board game also referred to as "Five in a Row"
Our tests demonstrate AlphaZero's versatility in adapting to games other than Go.
arXiv Detail & Related papers (2023-09-04T00:20:06Z) - Are AlphaZero-like Agents Robust to Adversarial Perturbations? [73.13944217915089]
AlphaZero (AZ) has demonstrated that neural-network-based Go AIs can surpass human performance by a large margin.
We ask whether adversarial states exist for Go AIs that may lead them to play surprisingly wrong actions.
We develop the first adversarial attack on Go AIs that can efficiently search for adversarial states by strategically reducing the search space.
arXiv Detail & Related papers (2022-11-07T18:43:25Z) - AlphaZero-Inspired General Board Game Learning and Playing [0.0]
Recently, the seminal algorithms AlphaGo and AlphaZero have started a new era in game learning and deep reinforcement learning.
In this paper, we pick an important element of AlphaZero - the Monte Carlo Tree Search (MCTS) planning stage - and combine it with reinforcement learning (RL) agents.
We apply this new architecture to several complex games (Othello, ConnectFour, Rubik's Cube) and show the advantages achieved with this AlphaZero-inspired MCTS wrapper.
arXiv Detail & Related papers (2022-04-28T07:04:14Z) - Adaptive Warm-Start MCTS in AlphaZero-like Deep Reinforcement Learning [5.55810668640617]
We propose a warm-start enhancement method for Monte Carlo Tree Search.
We show that our approach works better than the fixed $Iprime$, especially for "deep," tactical, games.
We conclude that AlphaZero-like deep reinforcement learning benefits from adaptive rollout based warm-start.
arXiv Detail & Related papers (2021-05-13T08:24:51Z) - Combining Off and On-Policy Training in Model-Based Reinforcement
Learning [77.34726150561087]
We propose a way to obtain off-policy targets using data from simulated games in MuZero.
Our results show that these targets speed up the training process and lead to faster convergence and higher rewards.
arXiv Detail & Related papers (2021-02-24T10:47:26Z) - Efficient exploration of zero-sum stochastic games [83.28949556413717]
We investigate the increasingly important and common game-solving setting where we do not have an explicit description of the game but only oracle access to it through gameplay.
During a limited-duration learning phase, the algorithm can control the actions of both players in order to try to learn the game and how to play it well.
Our motivation is to quickly learn strategies that have low exploitability in situations where evaluating the payoffs of a queried strategy profile is costly.
arXiv Detail & Related papers (2020-02-24T20:30:38Z) - Never Give Up: Learning Directed Exploration Strategies [63.19616370038824]
We propose a reinforcement learning agent to solve hard exploration games by learning a range of directed exploratory policies.
We construct an episodic memory-based intrinsic reward using k-nearest neighbors over the agent's recent experience to train the directed exploratory policies.
A self-supervised inverse dynamics model is used to train the embeddings of the nearest neighbour lookup, biasing the novelty signal towards what the agent can control.
arXiv Detail & Related papers (2020-02-14T13:57:22Z) - Provable Self-Play Algorithms for Competitive Reinforcement Learning [48.12602400021397]
We study self-play in competitive reinforcement learning under the setting of Markov games.
We show that a self-play algorithm achieves regret $tildemathcalO(sqrtT)$ after playing $T$ steps of the game.
We also introduce an explore-then-exploit style algorithm, which achieves a slightly worse regret $tildemathcalO(T2/3)$, but is guaranteed to run in time even in the worst case.
arXiv Detail & Related papers (2020-02-10T18:44:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.