Fugu-MT 論文翻訳(概要): Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes

論文の概要: Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes

arxiv url: http://arxiv.org/abs/2201.11206v1
Date: Wed, 26 Jan 2022 22:09:59 GMT
ステータス: 翻訳完了
システム内更新日: 2022-01-28 14:05:59.243059
Title: Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes
Title（参考訳）: 線形マルコフ決定過程におけるReward-free RLはReward-Aware RLより困難ではない
Authors: Andrew Wagenmaker, Yifang Chen, Max Simchowitz, Simon S. Du, Kevin Jamieson
Abstract要約: Reward-free reinforcement learning (RL) は、エージェントが探索中に報酬関数にアクセスできないような環境を考える。この分離は線形MDPの設定には存在しないことを示す。我々は$d$次元線形 MDP における報酬のない RL に対する計算効率の良いアルゴリズムを開発した。
参考スコア（独自算出の注目度）: 61.11090361892306
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration, but must propose a near-optimal policy for an arbitrary reward function revealed only after exploring. In the the tabular setting, it is well known that this is a more difficult problem than PAC RL -- where the agent has access to the reward function during exploration -- with optimal sample complexities in the two settings differing by a factor of $|\mathcal{S}|$, the size of the state space. We show that this separation does not exist in the setting of linear MDPs. We first develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP with sample complexity scaling as $\mathcal{O}(d^2/\epsilon^2)$. We then show a matching lower bound of $\Omega(d^2/\epsilon^2)$ on PAC RL. To our knowledge, our approach is the first computationally efficient algorithm to achieve optimal $d$ dependence in linear MDPs, even in the single-reward PAC setting. Our algorithm relies on a novel procedure which efficiently traverses a linear MDP, collecting samples in any given "feature direction", and enjoys a sample complexity scaling optimally in the (linear MDP equivalent of the) maximal state visitation probability. We show that this exploration procedure can also be applied to solve the problem of obtaining "well-conditioned" covariates in linear MDPs.
Abstract（参考訳）: Reward-free reinforcement learning (RL) は、エージェントが探索中に報酬関数にアクセスできないような設定を考えるが、探索後にのみ現れる任意の報酬関数に対して、ほぼ最適なポリシーを提案する必要がある。表の設定では、これはPAC RLよりも難しい問題であり、エージェントが探索中に報酬関数にアクセスでき、状態空間のサイズである$|\mathcal{S}|$で異なる2つの設定における最適なサンプル複雑度を持つことが知られている。この分離は線形MDPの設定には存在しないことを示す。まず,$d$ 次元線形 MDP における報酬のない RL の計算効率を$\mathcal{O}(d^2/\epsilon^2)$ とした。次に、PAC RL 上で $\Omega(d^2/\epsilon^2)$ の一致する下界を示す。我々の知る限り、本手法は一方向pac設定においても線形mdpにおける最適な$d$依存性を達成する最初の計算効率の高いアルゴリズムである。このアルゴリズムは、線形mdpを効率的に横断し、任意の「特徴方向」でサンプルを収集し、(線形mdpと同等の)最大状態訪問確率で最適にスケールするサンプル複雑性を享受する、新しい手順に依存している。線形MDPにおける「良条件」な共変量を得るためにも,この探索手法が適用可能であることを示す。

論文の概要: Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes

関連論文リスト