Fugu-MT 論文翻訳(概要): Representation Learning for Online and Offline RL in Low-rank MDPs

論文の概要: Representation Learning for Online and Offline RL in Low-rank MDPs

arxiv url: http://arxiv.org/abs/2110.04652v1
Date: Sat, 9 Oct 2021 22:04:34 GMT
ステータス: 翻訳完了
システム内更新日: 2021-10-12 14:33:19.927975
Title: Representation Learning for Online and Offline RL in Low-rank MDPs
Title（参考訳）: 低ランクMDPにおけるオンライン・オフラインRLの表現学習
Authors: Masatoshi Uehara, Xuezhou Zhang, Wen Sun
Abstract要約: 我々は、遷移力学が低ランク遷移行列に対応する低ランクマルコフ決定過程(MDP)に焦点を当てる。 FLAMBEで使用されるのと同じオーラクルで操作するオンライン環境では、RLのためのREP-UCBアッパー信頼境界表現学習アルゴリズムを提案する。オフラインのRL設定では、ペシミズムを利用して部分被覆条件下で学習するアルゴリズムを開発する。
参考スコア（独自算出の注目度）: 36.398511188102205
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work studies the question of Representation Learning in RL: how can we learn a compact low-dimensional representation such that on top of the representation we can perform RL procedures such as exploration and exploitation, in a sample efficient manner. We focus on the low-rank Markov Decision Processes (MDPs) where the transition dynamics correspond to a low-rank transition matrix. Unlike prior works that assume the representation is known (e.g., linear MDPs), here we need to learn the representation for the low-rank MDP. We study both the online RL and offline RL settings. For the online setting, operating with the same computational oracles used in FLAMBE (Agarwal et.al), the state-of-art algorithm for learning representations in low-rank MDPs, we propose an algorithm REP-UCB Upper Confidence Bound driven Representation learning for RL), which significantly improves the sample complexity from $\widetilde{O}( A^9 d^7 / (\epsilon^{10} (1-\gamma)^{22}))$ for FLAMBE to $\widetilde{O}( A^4 d^4 / (\epsilon^2 (1-\gamma)^{3}) )$ with $d$ being the rank of the transition matrix (or dimension of the ground truth representation), $A$ being the number of actions, and $\gamma$ being the discounted factor. Notably, REP-UCB is simpler than FLAMBE, as it directly balances the interplay between representation learning, exploration, and exploitation, while FLAMBE is an explore-then-commit style approach and has to perform reward-free exploration step-by-step forward in time. For the offline RL setting, we develop an algorithm that leverages pessimism to learn under a partial coverage condition: our algorithm is able to compete against any policy as long as it is covered by the offline distribution.
Abstract（参考訳）: 本研究では,RLにおける表現学習の課題について考察する。RLの表現の上に,探索や利用といったRLの手続きを,より効率的な方法で行うことができるような,コンパクトな低次元表現をどうやって学習できるか。我々は、遷移力学が低ランク遷移行列に対応する低ランクマルコフ決定過程(MDP)に焦点を当てる。表現が知られていると仮定する以前の研究(例えば線型 MDP)とは異なり、ここでは低ランク MDP の表現を学ぶ必要がある。オンラインRLとオフラインRLの両方について検討する。 For the online setting, operating with the same computational oracles used in FLAMBE (Agarwal et.al), the state-of-art algorithm for learning representations in low-rank MDPs, we propose an algorithm REP-UCB Upper Confidence Bound driven Representation learning for RL), which significantly improves the sample complexity from $\widetilde{O}( A^9 d^7 / (\epsilon^{10} (1-\gamma)^{22}))$ for FLAMBE to $\widetilde{O}( A^4 d^4 / (\epsilon^2 (1-\gamma)^{3}) )$ with $d$ being the rank of the transition matrix (or dimension of the ground truth representation), $A$ being the number of actions, and $\gamma$ being the discounted factor. 特に、REP-UCBはFLAMBEよりもシンプルで、表現学習、探索、搾取の相互作用を直接バランスさせ、FLAMBEは探索的コミットスタイルのアプローチであり、段階的に報酬のない探索を行う必要がある。オフラインのrl設定では,ペシミズムを利用して部分カバレッジ条件下で学習するアルゴリズムを開発した。

論文の概要: Representation Learning for Online and Offline RL in Low-rank MDPs

関連論文リスト