Combining Off and On-Policy Training in Model-Based Reinforcement
Learning
- URL: http://arxiv.org/abs/2102.12194v1
- Date: Wed, 24 Feb 2021 10:47:26 GMT
- Title: Combining Off and On-Policy Training in Model-Based Reinforcement
Learning
- Authors: Alexandre Borges and Arlindo Oliveira
- Abstract summary: We propose a way to obtain off-policy targets using data from simulated games in MuZero.
Our results show that these targets speed up the training process and lead to faster convergence and higher rewards.
- Score: 77.34726150561087
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The combination of deep learning and Monte Carlo Tree Search (MCTS) has shown
to be effective in various domains, such as board and video games. AlphaGo
represented a significant step forward in our ability to learn complex board
games, and it was rapidly followed by significant advances, such as AlphaGo
Zero and AlphaZero. Recently, MuZero demonstrated that it is possible to master
both Atari games and board games by directly learning a model of the
environment, which is then used with MCTS to decide what move to play in each
position. During tree search, the algorithm simulates games by exploring
several possible moves and then picks the action that corresponds to the most
promising trajectory. When training, limited use is made of these simulated
games since none of their trajectories are directly used as training examples.
Even if we consider that not all trajectories from simulated games are useful,
there are thousands of potentially useful trajectories that are discarded.
Using information from these trajectories would provide more training data,
more quickly, leading to faster convergence and higher sample efficiency.
Recent work introduced an off-policy value target for AlphaZero that uses data
from simulated games. In this work, we propose a way to obtain off-policy
targets using data from simulated games in MuZero. We combine these off-policy
targets with the on-policy targets already used in MuZero in several ways, and
study the impact of these targets and their combinations in three environments
with distinct characteristics. When used in the right combinations, our results
show that these targets speed up the training process and lead to faster
convergence and higher rewards than the ones obtained by MuZero.
Related papers
- Interpreting the Learned Model in MuZero Planning [12.47846647115319]
MuZero has achieved superhuman performance in various games by using a dynamics network to predict environment dynamics for planning.
This paper aims to demystify MuZero's model by interpreting the learned latent states.
arXiv Detail & Related papers (2024-11-07T10:06:23Z) - Autonomous Vehicle Controllers From End-to-End Differentiable Simulation [60.05963742334746]
We propose a differentiable simulator and design an analytic policy gradients (APG) approach to training AV controllers.
Our proposed framework brings the differentiable simulator into an end-to-end training loop, where gradients of environment dynamics serve as a useful prior to help the agent learn a more grounded policy.
We find significant improvements in performance and robustness to noise in the dynamics, as well as overall more intuitive human-like handling.
arXiv Detail & Related papers (2024-09-12T11:50:06Z) - MiniZero: Comparative Analysis of AlphaZero and MuZero on Go, Othello, and Atari Games [9.339645051415115]
MiniZero is a zero-knowledge learning framework that supports four state-of-the-art algorithms.
We evaluate the performance of each algorithm in two board games, 9x9 Go and 8x8 Othello, as well as 57 Atari games.
arXiv Detail & Related papers (2023-10-17T14:29:25Z) - Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with
Subgame Curriculum Learning [65.36326734799587]
We present a novel subgame curriculum learning framework for zero-sum games.
It adopts an adaptive initial state distribution by resetting agents to some previously visited states.
We derive a subgame selection metric that approximates the squared distance to NE values.
arXiv Detail & Related papers (2023-10-07T13:09:37Z) - Targeted Search Control in AlphaZero for Effective Policy Improvement [93.30151539224144]
We introduce Go-Exploit, a novel search control strategy for AlphaZero.
Go-Exploit samples the start state of its self-play trajectories from an archive of states of interest.
Go-Exploit learns with a greater sample efficiency than standard AlphaZero.
arXiv Detail & Related papers (2023-02-23T22:50:24Z) - Efficient Offline Policy Optimization with a Learned Model [83.64779942889916]
MuZero Unplugged presents a promising approach for offline policy learning from logged data.
It conducts Monte-Carlo Tree Search (MCTS) with a learned model and leverages Reanalyze algorithm to learn purely from offline data.
This paper investigates a few hypotheses where MuZero Unplugged may not work well under the offline settings.
arXiv Detail & Related papers (2022-10-12T07:41:04Z) - Complex Momentum for Learning in Games [42.081050296353574]
We generalize gradient descent with momentum for learning in differentiable games to have complex-valued momentum.
We empirically demonstrate that complex-valued momentum can improve convergence in games - like generative adversarial networks.
We also show a practical generalization to a complex-valued Adam variant, which we use to train BigGAN to better scores on CIFAR-10.
arXiv Detail & Related papers (2021-02-16T19:55:27Z) - Model-Based Reinforcement Learning for Atari [89.3039240303797]
We show how video prediction models can enable agents to solve Atari games with fewer interactions than model-free methods.
Our experiments evaluate SimPLe on a range of Atari games in low data regime of 100k interactions between the agent and the environment.
arXiv Detail & Related papers (2019-03-01T15:40:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.