Related papers: Improved Sample Complexity for Reward-free Reinforcement Learning under Low-rank MDPs

Improved Sample Complexity for Reward-free Reinforcement Learning under Low-rank MDPs

URL: http://arxiv.org/abs/2303.10859v1
Date: Mon, 20 Mar 2023 04:39:39 GMT
Title: Improved Sample Complexity for Reward-free Reinforcement Learning under Low-rank MDPs
Authors: Yuan Cheng, Ruiquan Huang, Jing Yang, Yingbin Liang
Abstract summary: This paper focuses on reward-free reinforcement learning under low-rank MDP models. We first provide the first known sample complexity lower bound for any algorithm under low-rank MDPs. We then propose a novel model-based algorithm, coined RAFFLE, and show it can both find an $epsilon$-optimal policy and achieve an $epsilon$-accurate system identification.
Score: 43.53286390357673
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In reward-free reinforcement learning (RL), an agent explores the environment first without any reward information, in order to achieve certain learning goals afterwards for any given reward. In this paper we focus on reward-free RL under low-rank MDP models, in which both the representation and linear weight vectors are unknown. Although various algorithms have been proposed for reward-free low-rank MDPs, the corresponding sample complexity is still far from being satisfactory. In this work, we first provide the first known sample complexity lower bound that holds for any algorithm under low-rank MDPs. This lower bound implies it is strictly harder to find a near-optimal policy under low-rank MDPs than under linear MDPs. We then propose a novel model-based algorithm, coined RAFFLE, and show it can both find an $\epsilon$-optimal policy and achieve an $\epsilon$-accurate system identification via reward-free exploration, with a sample complexity significantly improving the previous results. Such a sample complexity matches our lower bound in the dependence on $\epsilon$, as well as on $K$ in the large $d$ regime, where $d$ and $K$ respectively denote the representation dimension and action space cardinality. Finally, we provide a planning algorithm (without further interaction with true environment) for RAFFLE to learn a near-accurate representation, which is the first known representation learning guarantee under the same setting.

Related papers

Near-Optimal Sample Complexity Bounds for Constrained Average-Reward MDPs [6.237808815887583]
We study the sample complexity of learning an $epsilon$-optimal policy in constrained average-reward MDPs under a generative model.<n>Our results close the theoretical gap in understanding the complexity of constrained average-reward MDPs.
arXiv Detail & Related papers (2025-09-20T09:19:42Z)
A View of the Certainty-Equivalence Method for PAC RL as an Application of the Trajectory Tree Method [5.238591085233903]
This paper presents a theoretical investigation that stems from the surprising finding that CEM may indeed be viewed as an application of TTM. We obtain (3) improvements in the sample-complexity upper bounds for CEM both for non-stationary and stationary MDPs. We also show (4) a lower bound on the sample complexity for finite-horizon MDPs, which establishes the minimax-optimality of our upper bound for non-stationary MDPs.
arXiv Detail & Related papers (2025-01-05T20:37:34Z)
Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback [38.61232011566285]
We study the recently proposed model of RL with Aggregate Bandit Feedback (RL-ABF), where the agent only observes the sum of rewards at the end of an episode instead of each reward individually. In this paper, we extend ABF to linear function approximation and develop two efficient algorithms with near-optimal regret guarantees.
arXiv Detail & Related papers (2024-05-13T10:51:01Z)
Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs) At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward. We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z)
Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration. We show that this separation does not exist in the setting of linear MDPs. We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z)
On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function. We tackle this problem under the context of function approximation, leveraging powerful function approximators. We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z)
Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error. We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z)
Adaptive Reward-Free Exploration [48.98199700043158]
We show that our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994. We further investigate the relative complexities of reward-free exploration and best-policy identification.
arXiv Detail & Related papers (2020-06-11T09:58:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.