Fugu-MT 論文翻訳(概要): A Provably Efficient Sample Collection Strategy for Reinforcement Learning

論文の概要: A Provably Efficient Sample Collection Strategy for Reinforcement Learning

arxiv url: http://arxiv.org/abs/2007.06437v2
Date: Thu, 18 Nov 2021 15:36:55 GMT
ステータス: 翻訳完了
システム内更新日: 2022-11-10 22:55:23.984204
Title: A Provably Efficient Sample Collection Strategy for Reinforcement Learning
Title（参考訳）: 強化学習のための効果的なサンプル収集戦略
Authors: Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric
Abstract要約: オンライン強化学習(RL)における課題の1つは、エージェントがその振る舞いを最適化するために、環境の探索とサンプルの活用をトレードオフする必要があることである。 1) 生成モデル(環境のスパースシミュレータなど)にアクセス可能な状態のサンプル数を規定する「対象別」アルゴリズム,2) 所定のサンプルをできるだけ早く生成する「対象別」サンプル収集。
参考スコア（独自算出の注目度）: 123.69175280309226
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: One of the challenges in online reinforcement learning (RL) is that the agent needs to trade off the exploration of the environment and the exploitation of the samples to optimize its behavior. Whether we optimize for regret, sample complexity, state-space coverage or model estimation, we need to strike a different exploration-exploitation trade-off. In this paper, we propose to tackle the exploration-exploitation problem following a decoupled approach composed of: 1) An "objective-specific" algorithm that (adaptively) prescribes how many samples to collect at which states, as if it has access to a generative model (i.e., a simulator of the environment); 2) An "objective-agnostic" sample collection exploration strategy responsible for generating the prescribed samples as fast as possible. Building on recent methods for exploration in the stochastic shortest path problem, we first provide an algorithm that, given as input the number of samples $b(s,a)$ needed in each state-action pair, requires $\tilde{O}(B D + D^{3/2} S^2 A)$ time steps to collect the $B=\sum_{s,a} b(s,a)$ desired samples, in any unknown communicating MDP with $S$ states, $A$ actions and diameter $D$. Then we show how this general-purpose exploration algorithm can be paired with "objective-specific" strategies that prescribe the sample requirements to tackle a variety of settings -- e.g., model estimation, sparse reward discovery, goal-free cost-free exploration in communicating MDPs -- for which we obtain improved or novel sample complexity guarantees.
Abstract（参考訳）: オンライン強化学習(rl)における課題の1つは、エージェントがその動作を最適化するために環境の探索とサンプルの活用をトレードオフする必要があることである。後悔、サンプルの複雑さ、状態空間のカバレッジ、あるいはモデル推定を最適化するために、異なる探索と探索のトレードオフを打つ必要があります。本稿では, 切り離されたアプローチの後に, 探索・探索問題に取り組むことを提案する。 1) 生成モデル(例えば、環境のシミュレータ)へのアクセスがあるかのように、どの状態で収集するサンプル数を(適応的に)規定する「目的固有の」アルゴリズム。 2) 所定のサンプルをできるだけ早く生成する「目的に依存しない」サンプルコレクション探索戦略。確率的最短経路問題における最近の探索法に基づいて、まず、各状態-作用ペアに必要となるサンプル数$b(s,a)$を入力すると、$\tilde{O}(B D + D^{3/2} S^2A)$時間ステップで$B=\sum_{s,a} b(s,a)$所望のサンプルを収集できる。次に、この汎用探索アルゴリズムと、様々な設定(例えば、モデル推定、スパース報酬発見、mdp通信における目標フリーなコストフリー探索)に取り組むためのサンプル要求を規定する「目的固有の」戦略を組み合わせることにより、改善または新規なサンプル複雑性保証を得る方法を示す。

論文の概要: A Provably Efficient Sample Collection Strategy for Reinforcement Learning

関連論文リスト