Related papers: Mastering the Game of Go with Self-play Experience Replay

Mastering the Game of Go with Self-play Experience Replay

URL: http://arxiv.org/abs/2601.03306v1
Date: Tue, 06 Jan 2026 08:42:40 GMT
Title: Mastering the Game of Go with Self-play Experience Replay
Authors: Jingbin Liu, Xuechun Wang,
Abstract summary: We present QZero, a novel model-free reinforcement learning algorithm that forgoes search during training and learns a Nash equilibrium policy through self-play and off-policy experience replay.<n>Starting tabula rasa without human data and trained for 5 months with modest compute resources, QZero achieved a performance level comparable to that of AlphaGo.
Score: 5.792200378727493
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The game of Go has long served as a benchmark for artificial intelligence, demanding sophisticated strategic reasoning and long-term planning. Previous approaches such as AlphaGo and its successors, have predominantly relied on model-based Monte-Carlo Tree Search (MCTS). In this work, we present QZero, a novel model-free reinforcement learning algorithm that forgoes search during training and learns a Nash equilibrium policy through self-play and off-policy experience replay. Built upon entropy-regularized Q-learning, QZero utilizes a single Q-value network to unify policy evaluation and improvement. Starting tabula rasa without human data and trained for 5 months with modest compute resources (7 GPUs), QZero achieved a performance level comparable to that of AlphaGo. This demonstrates, for the first time, the efficiency of using model-free reinforcement learning to master the game of Go, as well as the feasibility of off-policy reinforcement learning in solving large-scale and complex environments.

Related papers

R-Zero: Self-Evolving Reasoning LLM from Zero Data [47.8125954446991]
Self-evolving Large Language Models (LLMs) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences.<n>Existing methods for training such models still rely heavily on vast human-curated tasks and labels.<n>We introduce R-Zero, a fully autonomous framework that generates its own training data from scratch.
arXiv Detail & Related papers (2025-08-07T03:38:16Z)
Reinforcement Learning in Strategy-Based and Atari Games: A Review of Google DeepMinds Innovations [0.0]
Reinforcement Learning (RL) has been widely used in many applications, particularly in gaming.<n>Google DeepMind has pioneered innovations in this field, employing reinforcement learning algorithms to create advanced AI models.<n>This paper reviews the significance of reinforcement learning applications in Atari and strategy-based games.
arXiv Detail & Related papers (2025-02-14T17:06:34Z)
Reinforcing Competitive Multi-Agents for Playing 'So Long Sucker' [0.12234742322758417]
This paper investigates the strategy game So Long Sucker (SLS) as a novel benchmark for multi-agent reinforcement learning (MARL)<n>We introduce the first publicly available computational framework for SLS, complete with a graphical user interface and benchmarking support for reinforcement learning algorithms.
arXiv Detail & Related papers (2024-11-17T12:38:13Z)
Improve Value Estimation of Q Function and Reshape Reward with Monte Carlo Tree Search [0.4450107621124637]
Reinforcement learning has achieved remarkable success in perfect information games such as Go and Atari. Research in reinforcement learning for imperfect information games has been relatively limited due to the more complex game structures and randomness. In this paper, we focus on Uno, an imperfect information game, and aim to address these problems by reducing Q value overestimation and reshaping reward function.
arXiv Detail & Related papers (2024-10-15T14:31:54Z)
Learning Answer Generation using Supervision from Automatic Question Answering Evaluators [98.9267570170737]
We propose a novel training paradigm for GenQA using supervision from automatic QA evaluation models (GAVA) We evaluate our proposed methods on two academic and one industrial dataset, obtaining a significant improvement in answering accuracy over the previous state of the art.
arXiv Detail & Related papers (2023-05-24T16:57:04Z)
Targeted Search Control in AlphaZero for Effective Policy Improvement [93.30151539224144]
We introduce Go-Exploit, a novel search control strategy for AlphaZero. Go-Exploit samples the start state of its self-play trajectories from an archive of states of interest. Go-Exploit learns with a greater sample efficiency than standard AlphaZero.
arXiv Detail & Related papers (2023-02-23T22:50:24Z)
Offline Q-Learning on Diverse Multi-Task Data Both Scales And Generalizes [100.69714600180895]
offline Q-learning algorithms exhibit strong performance that scales with model capacity. We train a single policy on 40 games with near-human performance using up-to 80 million parameter networks. Compared to return-conditioned supervised approaches, offline Q-learning scales similarly with model capacity and has better performance, especially when the dataset is suboptimal.
arXiv Detail & Related papers (2022-11-28T08:56:42Z)
Simplifying Deep Reinforcement Learning via Self-Supervision [51.2400839966489]
Self-Supervised Reinforcement Learning (SSRL) is a simple algorithm that optimize policies with purely supervised losses. We show that SSRL is surprisingly competitive to contemporary algorithms with more stable performance and less running time.
arXiv Detail & Related papers (2021-06-10T06:29:59Z)
On the Theory of Reinforcement Learning with Once-per-Episode Feedback [120.5537226120512]
We introduce a theory of reinforcement learning in which the learner receives feedback only once at the end of an episode. This is arguably more representative of real-world applications than the traditional requirement that the learner receive feedback at every time step.
arXiv Detail & Related papers (2021-05-29T19:48:51Z)
Combining Off and On-Policy Training in Model-Based Reinforcement Learning [77.34726150561087]
We propose a way to obtain off-policy targets using data from simulated games in MuZero. Our results show that these targets speed up the training process and lead to faster convergence and higher rewards.
arXiv Detail & Related papers (2021-02-24T10:47:26Z)
Chrome Dino Run using Reinforcement Learning [0.0]
We study most popular model free reinforcement learning algorithms along with convolutional neural network to train the agent for playing the game of Chrome Dino Run. We have used two of the popular temporal difference approaches namely Deep Q-Learning, and Expected SARSA and also implemented Double DQN model to train the agent.
arXiv Detail & Related papers (2020-08-15T22:18:20Z)
Warm-Start AlphaZero Self-Play Search Enhancements [5.096685900776467]
Recently, AlphaZero has achieved landmark results in deep reinforcement learning. We propose a novel approach to deal with this cold-start problem by employing simple search enhancements. Our experiments indicate that most of these enhancements improve the performance of their baseline player in three different (small) board games.
arXiv Detail & Related papers (2020-04-26T11:48:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.