Fugu-MT 論文翻訳(概要): Combining Off and On-Policy Training in Model-Based Reinforcement Learning

論文の概要: Combining Off and On-Policy Training in Model-Based Reinforcement Learning

arxiv url: http://arxiv.org/abs/2102.12194v1
Date: Wed, 24 Feb 2021 10:47:26 GMT
ステータス: 翻訳完了
システム内更新日: 2021-02-26 00:14:00.554768
Title: Combining Off and On-Policy Training in Model-Based Reinforcement Learning
Title（参考訳）: モデルベース強化学習におけるオフ・ポリシトレーニングとオン・ポリシトレーニングの組み合わせ
Authors: Alexandre Borges and Arlindo Oliveira
Abstract要約: MuZeroのシミュレートゲームから得られたデータを用いて、オフポリシターゲットの取得方法を提案する。以上の結果から,これらの目標がトレーニングプロセスのスピードアップと,より高速な収束とより高い報酬につながることが示唆された。
参考スコア（独自算出の注目度）: 77.34726150561087
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The combination of deep learning and Monte Carlo Tree Search (MCTS) has shown to be effective in various domains, such as board and video games. AlphaGo represented a significant step forward in our ability to learn complex board games, and it was rapidly followed by significant advances, such as AlphaGo Zero and AlphaZero. Recently, MuZero demonstrated that it is possible to master both Atari games and board games by directly learning a model of the environment, which is then used with MCTS to decide what move to play in each position. During tree search, the algorithm simulates games by exploring several possible moves and then picks the action that corresponds to the most promising trajectory. When training, limited use is made of these simulated games since none of their trajectories are directly used as training examples. Even if we consider that not all trajectories from simulated games are useful, there are thousands of potentially useful trajectories that are discarded. Using information from these trajectories would provide more training data, more quickly, leading to faster convergence and higher sample efficiency. Recent work introduced an off-policy value target for AlphaZero that uses data from simulated games. In this work, we propose a way to obtain off-policy targets using data from simulated games in MuZero. We combine these off-policy targets with the on-policy targets already used in MuZero in several ways, and study the impact of these targets and their combinations in three environments with distinct characteristics. When used in the right combinations, our results show that these targets speed up the training process and lead to faster convergence and higher rewards than the ones obtained by MuZero.
Abstract（参考訳）: ディープラーニングとモンテカルロ木探索(MCTS)の組み合わせは,ボードゲームやビデオゲームなど,さまざまな領域で有効であることが示されている。 AlphaGoは複雑なボードゲームを学ぶ能力において大きな進歩を示しており、AlphaGo ZeroやAlphaZeroといった大きな進歩が急速に続いた。最近、MuZeroは、環境のモデルを直接学習することによってAtariゲームとボードゲームの両方をマスターできることを実証しました。ツリー検索中、アルゴリズムはいくつかの可能な動きを探索してゲームをシミュレートし、最も有望な軌道に対応するアクションを選択します。トレーニングにおいて、これらのシミュレーションゲームは、どの軌道もトレーニングの例として直接使用しないため、限定的に使用される。シミュレーションゲームからのすべてのトラジェクトリが有用ではないと仮定しても、何千もの潜在的に有用なトラジェクトリが破棄されている。これらの軌道からの情報を使用することで、より高速なトレーニングデータが得られるようになり、より高速な収束とサンプル効率が向上する。最近の研究は、シミュレーションゲームのデータを使用するalphazeroのオフポリシー値ターゲットを導入した。本研究では,muzeroのシミュレーションゲームから得られたデータを用いて,オフポリシーターゲットを得る手法を提案する。これらのオフポリシーターゲットとムゼロで既に使用されているオンポリシーターゲットをいくつかの方法で組み合わせ、異なる特性を持つ3つの環境におけるこれらのターゲットとその組み合わせの影響を研究します。適切な組み合わせで使用すると、これらの目標がトレーニングプロセスを高速化し、MuZeroが得たものよりも早く収束し、より高い報酬をもたらすことが示される。

論文の概要: Combining Off and On-Policy Training in Model-Based Reinforcement Learning

関連論文リスト