Fugu-MT 論文翻訳(概要): Where is the Grass Greener? Revisiting Generalized Policy Iteration for Offline Reinforcement Learning

論文の概要: Where is the Grass Greener? Revisiting Generalized Policy Iteration for Offline Reinforcement Learning

arxiv url: http://arxiv.org/abs/2107.01407v1
Date: Sat, 3 Jul 2021 11:00:56 GMT
ステータス: 翻訳完了
システム内更新日: 2021-07-06 15:19:28.030894
Title: Where is the Grass Greener? Revisiting Generalized Policy Iteration for Offline Reinforcement Learning
Title（参考訳）: Grass Greenerはどこにありますか。オフライン強化学習のための一般政策イテレーションの再検討
Authors: Lionel Blond\'e, Alexandros Kalousis
Abstract要約: オフラインRL体制における最先端のベースラインを、公正で統一的で高分解能なフレームワークの下で再実装する。与えられたベースラインが、スペクトルの一方の端で競合する相手よりも優れている場合、他方の端では決してしないことを示す。
参考スコア（独自算出の注目度）: 81.15016852963676
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The performance of state-of-the-art baselines in the offline RL regime varies widely over the spectrum of dataset qualities, ranging from "far-from-optimal" random data to "close-to-optimal" expert demonstrations. We re-implement these under a fair, unified, and highly factorized framework, and show that when a given baseline outperforms its competing counterparts on one end of the spectrum, it never does on the other end. This consistent trend prevents us from naming a victor that outperforms the rest across the board. We attribute the asymmetry in performance between the two ends of the quality spectrum to the amount of inductive bias injected into the agent to entice it to posit that the behavior underlying the offline dataset is optimal for the task. The more bias is injected, the higher the agent performs, provided the dataset is close-to-optimal. Otherwise, its effect is brutally detrimental. Adopting an advantage-weighted regression template as base, we conduct an investigation which corroborates that injections of such optimality inductive bias, when not done parsimoniously, makes the agent subpar in the datasets it was dominant as soon as the offline policy is sub-optimal. In an effort to design methods that perform well across the whole spectrum, we revisit the generalized policy iteration scheme for the offline regime, and study the impact of nine distinct newly-introduced proposal distributions over actions, involved in proposed generalization of the policy evaluation and policy improvement update rules. We show that certain orchestrations strike the right balance and can improve the performance on one end of the spectrum without harming it on the other end.
Abstract（参考訳）: オフラインのRLレギュレーションにおける最先端のベースラインのパフォーマンスは、"極端から最適"なランダムデータから"極端から最適"な専門家のデモンストレーションまで、データセットの品質の範囲で大きく異なる。我々は、これらを公正で統一的で高分解能なフレームワークで再実装し、与えられたベースラインがスペクトルの一方の端で競合相手を上回る場合、反対側では決して実行されないことを示す。この一貫した傾向は、ボード全体の他の部分を上回る勝利を命名することを妨げる。我々は,品質スペクトルの両端間の性能の非対称性をエージェントに注入された誘導バイアスの量とみなして,オフラインデータセットの動作がタスクに最適であることを示す。バイアスが注入されるほど、データセットが最適に近い場合、エージェントのパフォーマンスが高くなる。そうでなければ、その効果は残酷に有害である。優位重み付き回帰テンプレートをベースとして、このような最適性誘導バイアスの注入がパロニカルに行われなければ、オフラインポリシーが準最適となると、エージェントが支配的なデータセットにサブパールする、という調査を行う。本研究は、全スペクトルにわたって良好に機能する手法を設計するために、オフライン体制における一般化されたポリシー反復方式を再検討し、政策評価及び政策改善更新規則の一般化にかかわる行動に対する9つの新たな提案分布の影響について検討する。特定のオーケストレーションが適切なバランスをとっており、一方のスペクトルの性能をもう一方の端で損なうことなく向上させることができることを示す。

論文の概要: Where is the Grass Greener? Revisiting Generalized Policy Iteration for Offline Reinforcement Learning

関連論文リスト