Fugu-MT 論文翻訳(概要): Partially Observable RL with B-Stability: Unified Structural Condition and Sharp Sample-Efficient Algorithms

論文の概要: Partially Observable RL with B-Stability: Unified Structural Condition and Sharp Sample-Efficient Algorithms

arxiv url: http://arxiv.org/abs/2209.14990v1
Date: Thu, 29 Sep 2022 17:51:51 GMT
ステータス: 翻訳完了
システム内更新日: 2022-09-30 15:59:27.384850
Title: Partially Observable RL with B-Stability: Unified Structural Condition and Sharp Sample-Efficient Algorithms
Title（参考訳）: B安定性を持つ部分観測可能なRL:統一構造条件とシャープサンプル効率アルゴリズム
Authors: Fan Chen, Yu Bai, Song Mei
Abstract要約: 本稿では、予測状態表現(PSR)の一般設定における部分観測可能RLの3つの側面について述べる。本稿では,emphB安定性(emphB-stability)と呼ばれるPSRの自然かつ統一的な構造条件を提案する。本稿では,B-stable PSRが関連する問題パラメータのサンプルで学習できることを示し,上記のサブクラスをインスタンス化すると,サンプルの複雑さが向上することを示した。
参考スコア（独自算出の注目度）: 25.658930892561735
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Partial Observability -- where agents can only observe partial information about the true underlying state of the system -- is ubiquitous in real-world applications of Reinforcement Learning (RL). Theoretically, learning a near-optimal policy under partial observability is known to be hard in the worst case due to an exponential sample complexity lower bound. Recent work has identified several tractable subclasses that are learnable with polynomial samples, such as Partially Observable Markov Decision Processes (POMDPs) with certain revealing or decodability conditions. However, this line of research is still in its infancy, where (1) unified structural conditions enabling sample-efficient learning are lacking; (2) existing sample complexities for known tractable subclasses are far from sharp; and (3) fewer sample-efficient algorithms are available than in fully observable RL. This paper advances all three aspects above for Partially Observable RL in the general setting of Predictive State Representations (PSRs). First, we propose a natural and unified structural condition for PSRs called \emph{B-stability}. B-stable PSRs encompasses the vast majority of known tractable subclasses such as weakly revealing POMDPs, low-rank future-sufficient POMDPs, decodable POMDPs, and regular PSRs. Next, we show that any B-stable PSR can be learned with polynomial samples in relevant problem parameters. When instantiated in the aforementioned subclasses, our sample complexities improve substantially over the current best ones. Finally, our results are achieved by three algorithms simultaneously: Optimistic Maximum Likelihood Estimation, Estimation-to-Decisions, and Model-Based Optimistic Posterior Sampling. The latter two algorithms are new for sample-efficient learning of POMDPs/PSRs.
Abstract（参考訳）: エージェントがシステムの真の基盤状態に関する部分的情報のみを観察できる部分的可観測性は、Reinforcement Learning (RL)の現実的な応用において、ユビキタスである。理論的には、部分的可観測性の下での最適に近いポリシーの学習は、指数関数的なサンプルの複雑さにより最悪の場合、難しいことが知られている。最近の研究では、多項式のサンプルで学習可能ないくつかの扱いやすいサブクラス、例えば部分可観測マルコフ決定プロセス(英語版)(pomdps)を特定している。しかし,本研究はまだ初期段階であり,(1)サンプル効率の学習が可能な統一的な構造条件が欠如している,(2)既知の抽出可能なサブクラスに対する既存のサンプル複雑度は,十分に観測可能なRLよりも少ない,などの特徴がある。本稿では, 予測状態表現(PSR)の一般設定における部分観測可能RLの3つの側面について述べる。まず、自然かつ統一的なPSRの構造条件である「emph{B-stability}」を提案する。 B-stable PSRは、弱いPOMDP、低ランクの将来のPMDP、デオード可能なPOMDP、レギュラーPSRなどの、既知の抽出可能なサブクラスの大部分を含んでいる。次に,任意のb-stable psrが関連する問題パラメータの多項式サンプルで学習可能であることを示す。上記のサブクラスでインスタンス化されると、サンプルの複雑さは現在のベストクラスよりも大幅に向上する。最後に, 最適最大次数推定, 推定対決定, モデルベース最適後サンプリングの3つのアルゴリズムを同時に実現した。後者の2つのアルゴリズムは、POMDP/PSRのサンプル効率向上のための新しいアルゴリズムである。

論文の概要: Partially Observable RL with B-Stability: Unified Structural Condition and Sharp Sample-Efficient Algorithms

関連論文リスト