Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes
- URL: http://arxiv.org/abs/2201.11206v1
- Date: Wed, 26 Jan 2022 22:09:59 GMT
- Title: Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes
- Authors: Andrew Wagenmaker, Yifang Chen, Max Simchowitz, Simon S. Du, Kevin
Jamieson
- Abstract summary: Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
- Score: 61.11090361892306
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reward-free reinforcement learning (RL) considers the setting where the agent
does not have access to a reward function during exploration, but must propose
a near-optimal policy for an arbitrary reward function revealed only after
exploring. In the the tabular setting, it is well known that this is a more
difficult problem than PAC RL -- where the agent has access to the reward
function during exploration -- with optimal sample complexities in the two
settings differing by a factor of $|\mathcal{S}|$, the size of the state space.
We show that this separation does not exist in the setting of linear MDPs. We
first develop a computationally efficient algorithm for reward-free RL in a
$d$-dimensional linear MDP with sample complexity scaling as
$\mathcal{O}(d^2/\epsilon^2)$. We then show a matching lower bound of
$\Omega(d^2/\epsilon^2)$ on PAC RL. To our knowledge, our approach is the first
computationally efficient algorithm to achieve optimal $d$ dependence in linear
MDPs, even in the single-reward PAC setting. Our algorithm relies on a novel
procedure which efficiently traverses a linear MDP, collecting samples in any
given "feature direction", and enjoys a sample complexity scaling optimally in
the (linear MDP equivalent of the) maximal state visitation probability. We
show that this exploration procedure can also be applied to solve the problem
of obtaining "well-conditioned" covariates in linear MDPs.
Related papers
- Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL)
We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo.
Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z) - Improved Sample Complexity for Reward-free Reinforcement Learning under
Low-rank MDPs [43.53286390357673]
This paper focuses on reward-free reinforcement learning under low-rank MDP models.
We first provide the first known sample complexity lower bound for any algorithm under low-rank MDPs.
We then propose a novel model-based algorithm, coined RAFFLE, and show it can both find an $epsilon$-optimal policy and achieve an $epsilon$-accurate system identification.
arXiv Detail & Related papers (2023-03-20T04:39:39Z) - Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs)
At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward.
We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - On Reward-Free RL with Kernel and Neural Function Approximations:
Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function.
We tackle this problem under the context of function approximation, leveraging powerful function approximators.
We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z) - On Reward-Free Reinforcement Learning with Linear Function Approximation [144.4210285338698]
Reward-free reinforcement learning (RL) is a framework which is suitable for both the batch RL setting and the setting where there are many reward functions of interest.
In this work, we give both positive and negative results for reward-free RL with linear function approximation.
arXiv Detail & Related papers (2020-06-19T17:59:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.