Fugu-MT 論文翻訳(概要): Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?

論文の概要: Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?

arxiv url: http://arxiv.org/abs/2005.00527v2
Date: Thu, 9 Jul 2020 16:20:06 GMT
ステータス: 翻訳完了
システム内更新日: 2022-12-07 22:33:27.608409
Title: Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?
Title（参考訳）: 長軸強化学習は短軸強化学習よりも難易度が高いか?
Authors: Ruosong Wang, Simon S. Du, Lin F. Yang, Sham M. Kakade
Abstract要約: 長い地平線を計画する学習は、エピソード強化学習問題における中心的な課題である。長地平線RLは、少なくともミニマックス感覚において、短地平線RLよりも困難ではないことを示す。
参考スコア（独自算出の注目度）: 108.94173231481355
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Learning to plan for long horizons is a central challenge in episodic reinforcement learning problems. A fundamental question is to understand how the difficulty of the problem scales as the horizon increases. Here the natural measure of sample complexity is a normalized one: we are interested in the number of episodes it takes to provably discover a policy whose value is $\varepsilon$ near to that of the optimal value, where the value is measured by the normalized cumulative reward in each episode. In a COLT 2018 open problem, Jiang and Agarwal conjectured that, for tabular, episodic reinforcement learning problems, there exists a sample complexity lower bound which exhibits a polynomial dependence on the horizon -- a conjecture which is consistent with all known sample complexity upper bounds. This work refutes this conjecture, proving that tabular, episodic reinforcement learning is possible with a sample complexity that scales only logarithmically with the planning horizon. In other words, when the values are appropriately normalized (to lie in the unit interval), this results shows that long horizon RL is no more difficult than short horizon RL, at least in a minimax sense. Our analysis introduces two ideas: (i) the construction of an $\varepsilon$-net for optimal policies whose log-covering number scales only logarithmically with the planning horizon, and (ii) the Online Trajectory Synthesis algorithm, which adaptively evaluates all policies in a given policy class using sample complexity that scales with the log-covering number of the given policy class. Both may be of independent interest.
Abstract（参考訳）: 長地平線計画への学習は、エピソディクス強化学習問題の中心的な課題である。根本的な問題は、地平線が増すにつれて問題の難易度がいかに大きくなるかを理解することである。ここでサンプル複雑性の自然な尺度は正規化されたものである:我々は、各エピソードの正規化累積報酬によって値が測定される最適値の値に近い$\varepsilon$のポリシーを立証するために要するエピソードの数に興味を持っている。 COLT 2018のオープンな問題において、JiangとAgarwalは、表層的、エピソジックな強化学習問題に対して、水平線に多項式依存を示すサンプル複雑性の下界が存在すると推測した。この研究はこの予想を反論し、計画の地平線と対数的にしかスケールしないサンプル複雑性によって、表象的、エピソディックな強化学習が可能であることを証明している。言い換えれば、値が適切に正規化されているとき(単位区間に置かれる)、この結果は少なくともミニマックス感覚において、長い地平線 RL が短地平線 RL よりも困難でないことを示す。分析では2つのアイデアを紹介します (i)ログ被覆数が計画地平線と対数的にしかスケールしない最適政策のための$\varepsilon$-netの構築 (II) オンライン軌道合成アルゴリズムは, ある政策クラスのログ検索数とスケールするサンプル複雑性を用いて, ある政策クラスの全ての政策を適応的に評価する。どちらも独立した関心事である。

論文の概要: Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?

関連論文リスト