Fugu-MT 論文翻訳(概要): Embed to Control Partially Observed Systems: Representation Learning with Provable Sample Efficiency

論文の概要: Embed to Control Partially Observed Systems: Representation Learning with Provable Sample Efficiency

arxiv url: http://arxiv.org/abs/2205.13476v2
Date: Mon, 1 Apr 2024 01:53:31 GMT
ステータス: 翻訳完了
システム内更新日: 2024-04-04 14:31:02.352415
Title: Embed to Control Partially Observed Systems: Representation Learning with Provable Sample Efficiency
Title（参考訳）: 部分観測システムへの埋め込み:確率的サンプル効率による表現学習
Authors: Lingxiao Wang, Qi Cai, Zhuoran Yang, Zhaoran Wang,
Abstract要約: 部分的に観察されたマルコフ決定過程(POMDP)における強化学習は2つの課題に直面している。しばしば、未来を予測するのに完全な歴史を要し、地平線と指数関数的にスケールするサンプルの複雑さを誘導する。本稿では,2段階の表現を最適化しながら学習するETC(Embed to Control)という強化学習アルゴリズムを提案する。
参考スコア（独自算出の注目度）: 105.17746223041954
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning in partially observed Markov decision processes (POMDPs) faces two challenges. (i) It often takes the full history to predict the future, which induces a sample complexity that scales exponentially with the horizon. (ii) The observation and state spaces are often continuous, which induces a sample complexity that scales exponentially with the extrinsic dimension. Addressing such challenges requires learning a minimal but sufficient representation of the observation and state histories by exploiting the structure of the POMDP. To this end, we propose a reinforcement learning algorithm named Embed to Control (ETC), which learns the representation at two levels while optimizing the policy.~(i) For each step, ETC learns to represent the state with a low-dimensional feature, which factorizes the transition kernel. (ii) Across multiple steps, ETC learns to represent the full history with a low-dimensional embedding, which assembles the per-step feature. We integrate (i) and (ii) in a unified framework that allows a variety of estimators (including maximum likelihood estimators and generative adversarial networks). For a class of POMDPs with a low-rank structure in the transition kernel, ETC attains an $O(1/\epsilon^2)$ sample complexity that scales polynomially with the horizon and the intrinsic dimension (that is, the rank). Here $\epsilon$ is the optimality gap. To our best knowledge, ETC is the first sample-efficient algorithm that bridges representation learning and policy optimization in POMDPs with infinite observation and state spaces.
Abstract（参考訳）: 部分的に観察されたマルコフ決定過程(POMDP)における強化学習は2つの課題に直面している。 (i)未来を予測するには、しばしば完全な歴史を要し、地平線と指数関数的にスケールするサンプルの複雑さを誘導する。 (II)観測空間と状態空間はしばしば連続であり、外生次元と指数関数的にスケールするサンプル複雑性を誘導する。このような課題に対処するには、POMDPの構造を利用して観測と状態履歴の最小かつ十分な表現を学ぶ必要がある。そこで本研究では,ポリシーを最適化しながら2段階の表現を学習するETC(Embed to Control)という強化学習アルゴリズムを提案する。 ~ i) 各ステップにおいて、ETCは、遷移カーネルを分解する低次元の特徴を持つ状態を表現することを学習する。 (ii)複数のステップにまたがって、ECCは、各ステップの特徴を組み立てる低次元の埋め込みを用いて、すべての履歴を表現することを学習する。統合 (i)および (ii) 様々な推定器(最大極大推定器や生成逆数ネットワークを含む)を許容する統一的な枠組みにおいて。遷移核に低ランク構造を持つPOMDPのクラスに対して、ECCは、水平線と内在次元(すなわちランク)と多項式的にスケールする$O(1/\epsilon^2)$サンプル複雑性を得る。ここで$\epsilon$は最適性ギャップです。我々の知る限り、ETCは、無限観測空間と状態空間を持つPOMDPにおける表現学習とポリシー最適化を橋渡しする最初のサンプル効率アルゴリズムである。

論文の概要: Embed to Control Partially Observed Systems: Representation Learning with Provable Sample Efficiency

関連論文リスト