Reward-Free Model-Based Reinforcement Learning with Linear Function
Approximation
- URL: http://arxiv.org/abs/2110.06394v1
- Date: Tue, 12 Oct 2021 23:03:58 GMT
- Title: Reward-Free Model-Based Reinforcement Learning with Linear Function
Approximation
- Authors: Weitong Zhang and Dongruo Zhou and Quanquan Gu
- Abstract summary: We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs)
In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy.
We show that to obtain an $epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $tilde O(H4d(H + d)epsilon-2)$ episodes.
- Score: 92.99933928528797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the model-based reward-free reinforcement learning with linear
function approximation for episodic Markov decision processes (MDPs). In this
setting, the agent works in two phases. In the exploration phase, the agent
interacts with the environment and collects samples without the reward. In the
planning phase, the agent is given a specific reward function and uses samples
collected from the exploration phase to learn a good policy. We propose a new
provably efficient algorithm, called UCRL-RFE under the Linear Mixture MDP
assumption, where the transition probability kernel of the MDP can be
parameterized by a linear function over certain feature mappings defined on the
triplet of state, action, and next state. We show that to obtain an
$\epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to
sample at most $\tilde O(H^5d^2\epsilon^{-2})$ episodes during the exploration
phase. Here, $H$ is the length of the episode, $d$ is the dimension of the
feature mapping. We also propose a variant of UCRL-RFE using Bernstein-type
bonus and show that it needs to sample at most $\tilde O(H^4d(H +
d)\epsilon^{-2})$ to achieve an $\epsilon$-optimal policy. By constructing a
special class of linear Mixture MDPs, we also prove that for any reward-free
algorithm, it needs to sample at least $\tilde \Omega(H^2d\epsilon^{-2})$
episodes to obtain an $\epsilon$-optimal policy. Our upper bound matches the
lower bound in terms of the dependence on $\epsilon$ and the dependence on $d$
if $H \ge d$.
Related papers
- Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Sharper Model-free Reinforcement Learning for Average-reward Markov
Decision Processes [21.77276136591518]
We develop provably efficient model-free reinforcement learning (RL) algorithms for Markov Decision Processes (MDPs)
In the simulator setting, we propose a model-free RL algorithm that finds an $epsilon$-optimal policy using $widetildeO left(fracSAmathrmsp(h*)epsilon2+fracS2Amathrmsp(h*)epsilon2right)$ samples.
arXiv Detail & Related papers (2023-06-28T17:43:19Z) - Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs)
At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward.
We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Nearly Minimax Optimal Reinforcement Learning with Linear Function
Approximation [25.60689712525918]
We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear.
We propose a novel-efficient algorithm, LSVI-UCB$+$, which achieves an $widetildeO(HdsqrtT)$ regret bound where $H$ is the episode length, $d$ is the feature dimension, and $T$ is the number of steps.
arXiv Detail & Related papers (2022-06-23T06:04:21Z) - Gap-Dependent Unsupervised Exploration for Reinforcement Learning [40.990467706237396]
We present an efficient algorithm for task-agnostic reinforcement learning.
The algorithm takes only $widetildemathcalO (1/epsilon cdot (H3SA / rho + H4 S2 A) )$ episodes of exploration.
We show that, information-theoretically, this bound is nearly tight for $rho Theta (1/(HS))$ and $H>1$.
arXiv Detail & Related papers (2021-08-11T20:42:46Z) - Nearly Minimax Optimal Reward-free Reinforcement Learning [88.75843804630772]
We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions.
We give a new efficient algorithm, textbfStaged textbfSampling + textbfTruncated textbfPlanning (algoname), which interacts with the environment at most $Oleft( fracS2Aepsilon2textpolylogleft(fracSAHepsilon2
arXiv Detail & Related papers (2020-10-12T17:51:19Z) - Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration.
We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration.
We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.