Fugu-MT 論文翻訳(概要): Improved Sample Complexity for Incremental Autonomous Exploration in MDPs

論文の概要: Improved Sample Complexity for Incremental Autonomous Exploration in MDPs

arxiv url: http://arxiv.org/abs/2012.14755v1
Date: Tue, 29 Dec 2020 14:06:09 GMT
ステータス: 翻訳完了
システム内更新日: 2021-04-18 20:46:41.347987
Title: Improved Sample Complexity for Incremental Autonomous Exploration in MDPs
Title（参考訳）: MDPにおけるインクリメンタル自律探査のためのサンプル複雑さの改善
Authors: Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric
Abstract要約: 我々は $epsilon$-optimal 目標条件付きポリシーのセットを学び、$ L$ ステップ内で段階的に到達可能なすべての状態を達成します。 DisCoは、コストに敏感な最短経路問題に対して$epsilon/c_min$-optimalポリシーを返すことができる最初のアルゴリズムです。
参考スコア（独自算出の注目度）: 132.88757893161699
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We investigate the exploration of an unknown environment when no reward function is provided. Building on the incremental exploration setting introduced by Lim and Auer [1], we define the objective of learning the set of $\epsilon$-optimal goal-conditioned policies attaining all states that are incrementally reachable within $L$ steps (in expectation) from a reference state $s_0$. In this paper, we introduce a novel model-based approach that interleaves discovering new states from $s_0$ and improving the accuracy of a model estimate that is used to compute goal-conditioned policies to reach newly discovered states. The resulting algorithm, DisCo, achieves a sample complexity scaling as $\tilde{O}(L^5 S_{L+\epsilon} \Gamma_{L+\epsilon} A \epsilon^{-2})$, where $A$ is the number of actions, $S_{L+\epsilon}$ is the number of states that are incrementally reachable from $s_0$ in $L+\epsilon$ steps, and $\Gamma_{L+\epsilon}$ is the branching factor of the dynamics over such states. This improves over the algorithm proposed in [1] in both $\epsilon$ and $L$ at the cost of an extra $\Gamma_{L+\epsilon}$ factor, which is small in most environments of interest. Furthermore, DisCo is the first algorithm that can return an $\epsilon/c_{\min}$-optimal policy for any cost-sensitive shortest-path problem defined on the $L$-reachable states with minimum cost $c_{\min}$. Finally, we report preliminary empirical results confirming our theoretical findings.
Abstract（参考訳）: 報酬関数が提供されない未知環境の探索について検討する。 lim と auer [1] によって導入されたインクリメンタルな探索設定に基づいて、参照状態 $s_0$ から$l$ ステップ以内に到達可能なすべての状態を達成するために、$\epsilon$-optimal goal-conditioned policies のセットを学習する目的を定義します。本稿では、新しい状態の発見を$s_0$からインターリーブし、ゴール条件付きポリシーを計算して新たに発見された状態に到達させるモデル推定の精度を向上させる新しいモデルベースアプローチを提案する。結果のアルゴリズムであるDisCoはサンプル複雑性のスケールを$\tilde{O}(L^5 S_{L+\epsilon} \Gamma_{L+\epsilon} A \epsilon^{-2})$, where $A$ is the number of action, $S_{L+\epsilon}$は $s_0$ in $L+\epsilon$ steps, $\Gamma_{L+\epsilon}$はそのような状態上の力学の分岐因子である。これは$\epsilon$と$l$の両方で[1]で提案されているアルゴリズムよりも改善され、ほとんどの関心のある環境では小さい$\gamma_{l+\epsilon}$ factorのコストがかかる。さらに、DisCo は$\epsilon/c_{\min}$-optimal policy を$L$-reachable state で最小コスト$c_{\min}$ で定義した任意のコスト感受性のショートパス問題に対して返すことができる最初のアルゴリズムである。最後に,我々の理論的知見を裏付ける予備実験結果について報告する。

論文の概要: Improved Sample Complexity for Incremental Autonomous Exploration in MDPs

関連論文リスト