Improved Sample Complexity for Reward-free Reinforcement Learning under
Low-rank MDPs
- URL: http://arxiv.org/abs/2303.10859v1
- Date: Mon, 20 Mar 2023 04:39:39 GMT
- Title: Improved Sample Complexity for Reward-free Reinforcement Learning under
Low-rank MDPs
- Authors: Yuan Cheng, Ruiquan Huang, Jing Yang, Yingbin Liang
- Abstract summary: This paper focuses on reward-free reinforcement learning under low-rank MDP models.
We first provide the first known sample complexity lower bound for any algorithm under low-rank MDPs.
We then propose a novel model-based algorithm, coined RAFFLE, and show it can both find an $epsilon$-optimal policy and achieve an $epsilon$-accurate system identification.
- Score: 43.53286390357673
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In reward-free reinforcement learning (RL), an agent explores the environment
first without any reward information, in order to achieve certain learning
goals afterwards for any given reward. In this paper we focus on reward-free RL
under low-rank MDP models, in which both the representation and linear weight
vectors are unknown. Although various algorithms have been proposed for
reward-free low-rank MDPs, the corresponding sample complexity is still far
from being satisfactory. In this work, we first provide the first known sample
complexity lower bound that holds for any algorithm under low-rank MDPs. This
lower bound implies it is strictly harder to find a near-optimal policy under
low-rank MDPs than under linear MDPs. We then propose a novel model-based
algorithm, coined RAFFLE, and show it can both find an $\epsilon$-optimal
policy and achieve an $\epsilon$-accurate system identification via reward-free
exploration, with a sample complexity significantly improving the previous
results. Such a sample complexity matches our lower bound in the dependence on
$\epsilon$, as well as on $K$ in the large $d$ regime, where $d$ and $K$
respectively denote the representation dimension and action space cardinality.
Finally, we provide a planning algorithm (without further interaction with true
environment) for RAFFLE to learn a near-accurate representation, which is the
first known representation learning guarantee under the same setting.
Related papers
- Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback [38.61232011566285]
We study the recently proposed model of RL with Aggregate Bandit Feedback (RL-ABF), where the agent only observes the sum of rewards at the end of an episode instead of each reward individually.
In this paper, we extend ABF to linear function approximation and develop two efficient algorithms with near-optimal regret guarantees.
arXiv Detail & Related papers (2024-05-13T10:51:01Z) - Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs)
At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward.
We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - On Reward-Free RL with Kernel and Neural Function Approximations:
Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function.
We tackle this problem under the context of function approximation, leveraging powerful function approximators.
We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z) - Adaptive Reward-Free Exploration [48.98199700043158]
We show that our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994.
We further investigate the relative complexities of reward-free exploration and best-policy identification.
arXiv Detail & Related papers (2020-06-11T09:58:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.