Fugu-MT 論文翻訳(概要): Reducing Planning Complexity of General Reinforcement Learning with Non-Markovian Abstractions

論文の概要: Reducing Planning Complexity of General Reinforcement Learning with Non-Markovian Abstractions

arxiv url: http://arxiv.org/abs/2112.13386v1
Date: Sun, 26 Dec 2021 14:26:41 GMT
ステータス: 翻訳完了
システム内更新日: 2021-12-29 04:02:45.204256
Title: Reducing Planning Complexity of General Reinforcement Learning with Non-Markovian Abstractions
Title（参考訳）: 非マルコフ抽象を用いた一般強化学習の計画複雑性の低減
Authors: Sultan J. Majeed and Marcus Hutter
Abstract要約: 一般に、一般強化学習(GRL)における準最適政策は、完全な歴史の関数である。我々は、より優れた$Oleft(varepsilon-1 cdot (1-gamma)-2 cdot A cdot 2Aright)$を許容する新しい非MDP抽象化を提案する。我々は、この境界が$Oleft(varepsilon-1 cdot (1-gamma)-2 cdotにさらに改善できることを示します。
参考スコア（独自算出の注目度）: 21.574781022415365
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The field of General Reinforcement Learning (GRL) formulates the problem of sequential decision-making from ground up. The history of interaction constitutes a "ground" state of the system, which never repeats. On the one hand, this generality allows GRL to model almost every domain possible, e.g.\ Bandits, MDPs, POMDPs, PSRs, and history-based environments. On the other hand, in general, the near-optimal policies in GRL are functions of complete history, which hinders not only learning but also planning in GRL. The usual way around for the planning part is that the agent is given a Markovian abstraction of the underlying process. So, it can use any MDP planning algorithm to find a near-optimal policy. The Extreme State Aggregation (ESA) framework has extended this idea to non-Markovian abstractions without compromising on the possibility of planning through a (surrogate) MDP. A distinguishing feature of ESA is that it proves an upper bound of $O\left(\varepsilon^{-A} \cdot (1-\gamma)^{-2A}\right)$ on the number of states required for the surrogate MDP (where $A$ is the number of actions, $\gamma$ is the discount-factor, and $\varepsilon$ is the optimality-gap) which holds \emph{uniformly} for \emph{all} domains. While the possibility of a universal bound is quite remarkable, we show that this bound is very loose. We propose a novel non-MDP abstraction which allows for a much better upper bound of $O\left(\varepsilon^{-1} \cdot (1-\gamma)^{-2} \cdot A \cdot 2^{A}\right)$. Furthermore, we show that this bound can be improved further to $O\left(\varepsilon^{-1} \cdot (1-\gamma)^{-2} \cdot \log^3 A \right)$ by using an action-sequentialization method.
Abstract（参考訳）: 一般強化学習(GRL)の分野は、逐次意思決定の問題を根本から定式化している。相互作用の歴史はシステムの"接地"状態を構成し、決して繰り返されない。一方、この一般化によりGRLは、バンド、MDP、POMDP、PSR、履歴ベースの環境など、ほぼ全ての領域をモデル化できる。一方、一般論として、GRLの準最適政策は完全な歴史の関数であり、GRLの学習だけでなく、計画も妨げている。計画部分の通常の方法は、エージェントが基礎となるプロセスのマルコフ的抽象化を与えられることである。したがって、任意のMDP計画アルゴリズムを使用して、ほぼ最適ポリシーを見つけることができる。 Extreme State Aggregation (ESA)フレームワークは、このアイデアを非マルコフ抽象に拡張した。 ESA の際立った特徴は、サロゲート MDP ($A$ はアクションの数、$\gamma$ は割引因子、$\varepsilon$ は最適値-ギャップ) に対して$O\left(\varepsilon^{-A} \cdot (1-\gamma)^{-2A}\right)$ の上限を証明し、これは \emph{all} ドメインに対して \emph{uniformly} を保持する。普遍境界の可能性は非常に顕著であるが、この境界は非常に緩いことを示す。我々は、より優れた$O\left(\varepsilon^{-1} \cdot (1-\gamma)^{-2} \cdot A \cdot 2^{A}\right)$の上限を許容する新しい非MDP抽象化を提案する。さらに、この境界は作用列化法を用いてさらに$o\left(\varepsilon^{-1} \cdot (1-\gamma)^{-2} \cdot \log^3 a \right)$に改善できることを示した。

論文の概要: Reducing Planning Complexity of General Reinforcement Learning with Non-Markovian Abstractions

関連論文リスト