Related papers: Combining Off and On-Policy Training in Model-Based Reinforcement Learning

Combining Off and On-Policy Training in Model-Based Reinforcement Learning

URL: http://arxiv.org/abs/2102.12194v1
Date: Wed, 24 Feb 2021 10:47:26 GMT
Title: Combining Off and On-Policy Training in Model-Based Reinforcement Learning
Authors: Alexandre Borges and Arlindo Oliveira
Abstract summary: We propose a way to obtain off-policy targets using data from simulated games in MuZero. Our results show that these targets speed up the training process and lead to faster convergence and higher rewards.
Score: 77.34726150561087
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The combination of deep learning and Monte Carlo Tree Search (MCTS) has shown to be effective in various domains, such as board and video games. AlphaGo represented a significant step forward in our ability to learn complex board games, and it was rapidly followed by significant advances, such as AlphaGo Zero and AlphaZero. Recently, MuZero demonstrated that it is possible to master both Atari games and board games by directly learning a model of the environment, which is then used with MCTS to decide what move to play in each position. During tree search, the algorithm simulates games by exploring several possible moves and then picks the action that corresponds to the most promising trajectory. When training, limited use is made of these simulated games since none of their trajectories are directly used as training examples. Even if we consider that not all trajectories from simulated games are useful, there are thousands of potentially useful trajectories that are discarded. Using information from these trajectories would provide more training data, more quickly, leading to faster convergence and higher sample efficiency. Recent work introduced an off-policy value target for AlphaZero that uses data from simulated games. In this work, we propose a way to obtain off-policy targets using data from simulated games in MuZero. We combine these off-policy targets with the on-policy targets already used in MuZero in several ways, and study the impact of these targets and their combinations in three environments with distinct characteristics. When used in the right combinations, our results show that these targets speed up the training process and lead to faster convergence and higher rewards than the ones obtained by MuZero.

Related papers

Interpreting the Learned Model in MuZero Planning [12.47846647115319]
MuZero has achieved superhuman performance in various games by using a dynamics network to predict environment dynamics for planning. This paper aims to demystify MuZero's model by interpreting the learned latent states.
arXiv Detail & Related papers (2024-11-07T10:06:23Z)
Autonomous Vehicle Controllers From End-to-End Differentiable Simulation [60.05963742334746]
We propose a differentiable simulator and design an analytic policy gradients (APG) approach to training AV controllers. Our proposed framework brings the differentiable simulator into an end-to-end training loop, where gradients of environment dynamics serve as a useful prior to help the agent learn a more grounded policy. We find significant improvements in performance and robustness to noise in the dynamics, as well as overall more intuitive human-like handling.
arXiv Detail & Related papers (2024-09-12T11:50:06Z)
MiniZero: Comparative Analysis of AlphaZero and MuZero on Go, Othello, and Atari Games [9.339645051415115]
MiniZero is a zero-knowledge learning framework that supports four state-of-the-art algorithms. We evaluate the performance of each algorithm in two board games, 9x9 Go and 8x8 Othello, as well as 57 Atari games.
arXiv Detail & Related papers (2023-10-17T14:29:25Z)
Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning [65.36326734799587]
We present a novel subgame curriculum learning framework for zero-sum games. It adopts an adaptive initial state distribution by resetting agents to some previously visited states. We derive a subgame selection metric that approximates the squared distance to NE values.
arXiv Detail & Related papers (2023-10-07T13:09:37Z)
Targeted Search Control in AlphaZero for Effective Policy Improvement [93.30151539224144]
We introduce Go-Exploit, a novel search control strategy for AlphaZero. Go-Exploit samples the start state of its self-play trajectories from an archive of states of interest. Go-Exploit learns with a greater sample efficiency than standard AlphaZero.
arXiv Detail & Related papers (2023-02-23T22:50:24Z)
Efficient Offline Policy Optimization with a Learned Model [83.64779942889916]
MuZero Unplugged presents a promising approach for offline policy learning from logged data. It conducts Monte-Carlo Tree Search (MCTS) with a learned model and leverages Reanalyze algorithm to learn purely from offline data. This paper investigates a few hypotheses where MuZero Unplugged may not work well under the offline settings.
arXiv Detail & Related papers (2022-10-12T07:41:04Z)
Complex Momentum for Learning in Games [42.081050296353574]
We generalize gradient descent with momentum for learning in differentiable games to have complex-valued momentum. We empirically demonstrate that complex-valued momentum can improve convergence in games - like generative adversarial networks. We also show a practical generalization to a complex-valued Adam variant, which we use to train BigGAN to better scores on CIFAR-10.
arXiv Detail & Related papers (2021-02-16T19:55:27Z)
Model-Based Reinforcement Learning for Atari [89.3039240303797]
We show how video prediction models can enable agents to solve Atari games with fewer interactions than model-free methods. Our experiments evaluate SimPLe on a range of Atari games in low data regime of 100k interactions between the agent and the environment.
arXiv Detail & Related papers (2019-03-01T15:40:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.