Related papers: On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game

On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game

URL: http://arxiv.org/abs/2110.09771v1
Date: Tue, 19 Oct 2021 07:26:33 GMT
Title: On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game
Authors: Shuang Qiu, Jieping Ye, Zhaoran Wang, Zhuoran Yang
Abstract summary: We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function. We tackle this problem under the context of function approximation, leveraging powerful function approximators. We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
Score: 140.19656665344917
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To achieve sample efficiency in reinforcement learning (RL), it necessitates efficiently exploring the underlying environment. Under the offline setting, addressing the exploration challenge lies in collecting an offline dataset with sufficient coverage. Motivated by such a challenge, we study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function. Then, given any extrinsic reward, the agent computes the policy via a planning algorithm with offline data collected in the exploration phase. Moreover, we tackle this problem under the context of function approximation, leveraging powerful function approximators. Specifically, we propose to explore via an optimistic variant of the value-iteration algorithm incorporating kernel and neural function approximations, where we adopt the associated exploration bonus as the exploration reward. Moreover, we design exploration and planning algorithms for both single-agent MDPs and zero-sum Markov games and prove that our methods can achieve $\widetilde{\mathcal{O}}(1 /\varepsilon^2)$ sample complexity for generating a $\varepsilon$-suboptimal policy or $\varepsilon$-approximate Nash equilibrium when given an arbitrary extrinsic reward. To the best of our knowledge, we establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.

Related papers

Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning [17.239062061431646]
This paper studies reward-agnostic exploration in reinforcement learning (RL) Consider a finite-horizon inhomogeneous decision process with $S$ states, $A$ actions, and a horizon length $H$. Our algorithm is able to yield $varepsilon$ accuracy for arbitrarily many reward functions.
arXiv Detail & Related papers (2023-04-14T17:46:49Z)
Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs) At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward. We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z)
Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning with Linear Function Approximation [16.871660060209674]
We study the problem of deployment efficient reinforcement learning (RL) with linear function approximation under the emphreward-free exploration setting. We propose a new algorithm that collects at most $widetildeO(fracd2H5epsilon2)$ trajectories within $H$ deployments to identify $epsilon$-optimal policy for any (possibly data-dependent) choice of reward functions.
arXiv Detail & Related papers (2022-10-03T03:48:26Z)
Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED) PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z)
Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration. We show that this separation does not exist in the setting of linear MDPs. We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z)
On Reward-Free Reinforcement Learning with Linear Function Approximation [144.4210285338698]
Reward-free reinforcement learning (RL) is a framework which is suitable for both the batch RL setting and the setting where there are many reward functions of interest. In this work, we give both positive and negative results for reward-free RL with linear function approximation.
arXiv Detail & Related papers (2020-06-19T17:59:36Z)
Exploration by Maximizing R\'enyi Entropy for Reward-Free RL Framework [28.430845498323745]
We consider a reward-free reinforcement learning framework that separates exploration from exploitation. In the exploration phase, the agent learns an exploratory policy by interacting with a reward-free environment. In the planning phase, the agent computes a good policy for any reward function based on the dataset.
arXiv Detail & Related papers (2020-06-11T05:05:31Z)
Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration. We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration. We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.