Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs
- URL: http://arxiv.org/abs/2303.10165v2
- Date: Wed, 14 Feb 2024 17:44:08 GMT
- Title: Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs
- Authors: Junkai Zhang and Weitong Zhang and Quanquan Gu
- Abstract summary: We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs)
At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward.
We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
- Score: 60.40452803295326
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study reward-free reinforcement learning (RL) with linear function
approximation, where the agent works in two phases: (1) in the exploration
phase, the agent interacts with the environment but cannot access the reward;
and (2) in the planning phase, the agent is given a reward function and is
expected to find a near-optimal policy based on samples collected in the
exploration phase. The sample complexities of existing reward-free algorithms
have a polynomial dependence on the planning horizon, which makes them
intractable for long planning horizon RL problems. In this paper, we propose a
new reward-free algorithm for learning linear mixture Markov decision processes
(MDPs), where the transition probability can be parameterized as a linear
combination of known feature mappings. At the core of our algorithm is
uncertainty-weighted value-targeted regression with exploration-driven
pseudo-reward and a high-order moment estimator for the aleatoric and epistemic
uncertainties. When the total reward is bounded by $1$, we show that our
algorithm only needs to explore $\tilde O( d^2\varepsilon^{-2})$ episodes to
find an $\varepsilon$-optimal policy, where $d$ is the dimension of the feature
mapping. The sample complexity of our algorithm only has a polylogarithmic
dependence on the planning horizon and therefore is "horizon-free". In
addition, we provide an $\Omega(d^2\varepsilon^{-2})$ sample complexity lower
bound, which matches the sample complexity of our algorithm up to logarithmic
factors, suggesting that our algorithm is optimal.
Related papers
- Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning
with Linear Function Approximation [16.871660060209674]
We study the problem of deployment efficient reinforcement learning (RL) with linear function approximation under the emphreward-free exploration setting.
We propose a new algorithm that collects at most $widetildeO(fracd2H5epsilon2)$ trajectories within $H$ deployments to identify $epsilon$-optimal policy for any (possibly data-dependent) choice of reward functions.
arXiv Detail & Related papers (2022-10-03T03:48:26Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - On Reward-Free RL with Kernel and Neural Function Approximations:
Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function.
We tackle this problem under the context of function approximation, leveraging powerful function approximators.
We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z) - On Reward-Free Reinforcement Learning with Linear Function Approximation [144.4210285338698]
Reward-free reinforcement learning (RL) is a framework which is suitable for both the batch RL setting and the setting where there are many reward functions of interest.
In this work, we give both positive and negative results for reward-free RL with linear function approximation.
arXiv Detail & Related papers (2020-06-19T17:59:36Z) - Adaptive Reward-Free Exploration [48.98199700043158]
We show that our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994.
We further investigate the relative complexities of reward-free exploration and best-policy identification.
arXiv Detail & Related papers (2020-06-11T09:58:03Z) - Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration.
We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration.
We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.