Representation Learning for Online and Offline RL in Low-rank MDPs
- URL: http://arxiv.org/abs/2110.04652v1
- Date: Sat, 9 Oct 2021 22:04:34 GMT
- Title: Representation Learning for Online and Offline RL in Low-rank MDPs
- Authors: Masatoshi Uehara, Xuezhou Zhang, Wen Sun
- Abstract summary: We focus on the low-rank Markov Decision Processes (MDPs) where the transition dynamics correspond to a low-rank transition matrix.
For the online setting, operating with the same computational oracles used in FLAMBE, we propose an algorithm REP-UCB Upper Confidence Bound Representation learning for RL.
For the offline RL setting, we develop an algorithm that leverages pessimism to learn under a partial coverage condition.
- Score: 36.398511188102205
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work studies the question of Representation Learning in RL: how can we
learn a compact low-dimensional representation such that on top of the
representation we can perform RL procedures such as exploration and
exploitation, in a sample efficient manner. We focus on the low-rank Markov
Decision Processes (MDPs) where the transition dynamics correspond to a
low-rank transition matrix. Unlike prior works that assume the representation
is known (e.g., linear MDPs), here we need to learn the representation for the
low-rank MDP. We study both the online RL and offline RL settings. For the
online setting, operating with the same computational oracles used in FLAMBE
(Agarwal et.al), the state-of-art algorithm for learning representations in
low-rank MDPs, we propose an algorithm REP-UCB Upper Confidence Bound driven
Representation learning for RL), which significantly improves the sample
complexity from $\widetilde{O}( A^9 d^7 / (\epsilon^{10} (1-\gamma)^{22}))$ for
FLAMBE to $\widetilde{O}( A^4 d^4 / (\epsilon^2 (1-\gamma)^{3}) )$ with $d$
being the rank of the transition matrix (or dimension of the ground truth
representation), $A$ being the number of actions, and $\gamma$ being the
discounted factor. Notably, REP-UCB is simpler than FLAMBE, as it directly
balances the interplay between representation learning, exploration, and
exploitation, while FLAMBE is an explore-then-commit style approach and has to
perform reward-free exploration step-by-step forward in time. For the offline
RL setting, we develop an algorithm that leverages pessimism to learn under a
partial coverage condition: our algorithm is able to compete against any policy
as long as it is covered by the offline distribution.
Related papers
- Provably Efficient CVaR RL in Low-rank MDPs [58.58570425202862]
We study risk-sensitive Reinforcement Learning (RL)
We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to balance interplay between exploration, exploitation, and representation learning in CVaR RL.
We prove that our algorithm achieves a sample complexity of $epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations.
arXiv Detail & Related papers (2023-11-20T17:44:40Z) - Contrastive UCB: Provably Efficient Contrastive Self-Supervised Learning in Online Reinforcement Learning [92.18524491615548]
Contrastive self-supervised learning has been successfully integrated into the practice of (deep) reinforcement learning (RL)
We study how RL can be empowered by contrastive learning in a class of Markov decision processes (MDPs) and Markov games (MGs) with low-rank transitions.
Under the online setting, we propose novel upper confidence bound (UCB)-type algorithms that incorporate such a contrastive loss with online RL algorithms for MDPs or MGs.
arXiv Detail & Related papers (2022-07-29T17:29:08Z) - Contrastive Learning as Goal-Conditioned Reinforcement Learning [147.28638631734486]
In reinforcement learning (RL), it is easier to solve a task if given a good representation.
While deep RL should automatically acquire such good representations, prior work often finds that learning representations in an end-to-end fashion is unstable.
We show (contrastive) representation learning methods can be cast as RL algorithms in their own right.
arXiv Detail & Related papers (2022-06-15T14:34:15Z) - Provable Benefit of Multitask Representation Learning in Reinforcement
Learning [46.11628795660159]
This paper theoretically characterizes the benefit of representation learning under the low-rank Markov decision process (MDP) model.
To the best of our knowledge, this is the first theoretical study that characterizes the benefit of representation learning in exploration-based reward-free multitask reinforcement learning.
arXiv Detail & Related papers (2022-06-13T04:29:02Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - Provably Efficient Representation Selection in Low-rank Markov Decision
Processes: From Online to Offline RL [84.14947307790361]
We propose an efficient algorithm, called ReLEX, for representation learning in both online and offline reinforcement learning.
We show that the online version of ReLEX, called Re-UCB, always performs no worse than the state-of-the-art algorithm without representation selection.
For the offline counterpart, ReLEX-LCB, we show that the algorithm can find the optimal policy if the representation class can cover the state-action space.
arXiv Detail & Related papers (2021-06-22T17:16:50Z) - FLAMBE: Structural Complexity and Representation Learning of Low Rank
MDPs [53.710405006523274]
This work focuses on the representation learning question: how can we learn such features?
Under the assumption that the underlying (unknown) dynamics correspond to a low rank transition matrix, we show how the representation learning question is related to a particular non-linear matrix decomposition problem.
We develop FLAMBE, which engages in exploration and representation learning for provably efficient RL in low rank transition models.
arXiv Detail & Related papers (2020-06-18T19:11:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.