On Reward-Free RL with Kernel and Neural Function Approximations:
Single-Agent MDP and Markov Game
- URL: http://arxiv.org/abs/2110.09771v1
- Date: Tue, 19 Oct 2021 07:26:33 GMT
- Title: On Reward-Free RL with Kernel and Neural Function Approximations:
Single-Agent MDP and Markov Game
- Authors: Shuang Qiu, Jieping Ye, Zhaoran Wang, Zhuoran Yang
- Abstract summary: We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function.
We tackle this problem under the context of function approximation, leveraging powerful function approximators.
We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
- Score: 140.19656665344917
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To achieve sample efficiency in reinforcement learning (RL), it necessitates
efficiently exploring the underlying environment. Under the offline setting,
addressing the exploration challenge lies in collecting an offline dataset with
sufficient coverage. Motivated by such a challenge, we study the reward-free RL
problem, where an agent aims to thoroughly explore the environment without any
pre-specified reward function. Then, given any extrinsic reward, the agent
computes the policy via a planning algorithm with offline data collected in the
exploration phase. Moreover, we tackle this problem under the context of
function approximation, leveraging powerful function approximators.
Specifically, we propose to explore via an optimistic variant of the
value-iteration algorithm incorporating kernel and neural function
approximations, where we adopt the associated exploration bonus as the
exploration reward. Moreover, we design exploration and planning algorithms for
both single-agent MDPs and zero-sum Markov games and prove that our methods can
achieve $\widetilde{\mathcal{O}}(1 /\varepsilon^2)$ sample complexity for
generating a $\varepsilon$-suboptimal policy or $\varepsilon$-approximate Nash
equilibrium when given an arbitrary extrinsic reward. To the best of our
knowledge, we establish the first provably efficient reward-free RL algorithm
with kernel and neural function approximators.
Related papers
- Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning [17.239062061431646]
This paper studies reward-agnostic exploration in reinforcement learning (RL)
Consider a finite-horizon inhomogeneous decision process with $S$ states, $A$ actions, and a horizon length $H$.
Our algorithm is able to yield $varepsilon$ accuracy for arbitrarily many reward functions.
arXiv Detail & Related papers (2023-04-14T17:46:49Z) - Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs)
At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward.
We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z) - Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning
with Linear Function Approximation [16.871660060209674]
We study the problem of deployment efficient reinforcement learning (RL) with linear function approximation under the emphreward-free exploration setting.
We propose a new algorithm that collects at most $widetildeO(fracd2H5epsilon2)$ trajectories within $H$ deployments to identify $epsilon$-optimal policy for any (possibly data-dependent) choice of reward functions.
arXiv Detail & Related papers (2022-10-03T03:48:26Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - On Reward-Free Reinforcement Learning with Linear Function Approximation [144.4210285338698]
Reward-free reinforcement learning (RL) is a framework which is suitable for both the batch RL setting and the setting where there are many reward functions of interest.
In this work, we give both positive and negative results for reward-free RL with linear function approximation.
arXiv Detail & Related papers (2020-06-19T17:59:36Z) - Exploration by Maximizing R\'enyi Entropy for Reward-Free RL Framework [28.430845498323745]
We consider a reward-free reinforcement learning framework that separates exploration from exploitation.
In the exploration phase, the agent learns an exploratory policy by interacting with a reward-free environment.
In the planning phase, the agent computes a good policy for any reward function based on the dataset.
arXiv Detail & Related papers (2020-06-11T05:05:31Z) - Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration.
We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration.
We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.