Task-agnostic Exploration in Reinforcement Learning
- URL: http://arxiv.org/abs/2006.09497v1
- Date: Tue, 16 Jun 2020 20:23:41 GMT
- Title: Task-agnostic Exploration in Reinforcement Learning
- Authors: Xuezhou Zhang, Yuzhe ma, Adish Singla
- Abstract summary: We present an efficient task-agnostic reinforcement learning algorithm, textscUCBZero.
It finds $epsilon$-optimal policies for $N$ arbitrary tasks after at most $tilde O(log(N)H5SA/epsilon2)$ exploration episodes.
We also provide an $Omega(log (N)H2SA/epsilon2)$ lower bound, showing that the $log$ dependency on $N$ is unavoidable.
- Score: 35.403304641170386
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Efficient exploration is one of the main challenges in reinforcement learning
(RL). Most existing sample-efficient algorithms assume the existence of a
single reward function during exploration. In many practical scenarios,
however, there is not a single underlying reward function to guide the
exploration, for instance, when an agent needs to learn many skills
simultaneously, or multiple conflicting objectives need to be balanced. To
address these challenges, we propose the \textit{task-agnostic RL} framework:
In the exploration phase, the agent first collects trajectories by exploring
the MDP without the guidance of a reward function. After exploration, it aims
at finding near-optimal policies for $N$ tasks, given the collected
trajectories augmented with \textit{sampled rewards} for each task. We present
an efficient task-agnostic RL algorithm, \textsc{UCBZero}, that finds
$\epsilon$-optimal policies for $N$ arbitrary tasks after at most $\tilde
O(\log(N)H^5SA/\epsilon^2)$ exploration episodes. We also provide an
$\Omega(\log (N)H^2SA/\epsilon^2)$ lower bound, showing that the $\log$
dependency on $N$ is unavoidable. Furthermore, we provide an $N$-independent
sample complexity bound of \textsc{UCBZero} in the statistically easier setting
when the ground truth reward functions are known.
Related papers
- Uncertainty-Aware Reward-Free Exploration with General Function Approximation [69.27868448449755]
In this paper, we propose a reward-free reinforcement learning algorithm called alg.
The key idea behind our algorithm is an uncertainty-aware intrinsic reward for exploring the environment.
Experiment results show that GFA-RFE outperforms or is comparable to the performance of state-of-the-art unsupervised RL algorithms.
arXiv Detail & Related papers (2024-06-24T01:37:18Z) - Provably Feedback-Efficient Reinforcement Learning via Active Reward
Learning [26.067411894141863]
An appropriate reward function is of paramount importance in specifying a task in reinforcement learning (RL)
Human-in-the-loop (HiL) RL allows humans to communicate complex goals to the RL agent by providing various types of feedback.
We provide an active-learning-based RL algorithm that first explores the environment without specifying a reward function.
arXiv Detail & Related papers (2023-04-18T12:36:09Z) - Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning [17.239062061431646]
This paper studies reward-agnostic exploration in reinforcement learning (RL)
Consider a finite-horizon inhomogeneous decision process with $S$ states, $A$ actions, and a horizon length $H$.
Our algorithm is able to yield $varepsilon$ accuracy for arbitrarily many reward functions.
arXiv Detail & Related papers (2023-04-14T17:46:49Z) - On Reward-Free RL with Kernel and Neural Function Approximations:
Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function.
We tackle this problem under the context of function approximation, leveraging powerful function approximators.
We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z) - Reward-Free Model-Based Reinforcement Learning with Linear Function
Approximation [92.99933928528797]
We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs)
In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy.
We show that to obtain an $epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $tilde O(H4d(H + d)epsilon-2)$ episodes.
arXiv Detail & Related papers (2021-10-12T23:03:58Z) - Gap-Dependent Unsupervised Exploration for Reinforcement Learning [40.990467706237396]
We present an efficient algorithm for task-agnostic reinforcement learning.
The algorithm takes only $widetildemathcalO (1/epsilon cdot (H3SA / rho + H4 S2 A) )$ episodes of exploration.
We show that, information-theoretically, this bound is nearly tight for $rho Theta (1/(HS))$ and $H>1$.
arXiv Detail & Related papers (2021-08-11T20:42:46Z) - Nearly Minimax Optimal Reward-free Reinforcement Learning [88.75843804630772]
We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions.
We give a new efficient algorithm, textbfStaged textbfSampling + textbfTruncated textbfPlanning (algoname), which interacts with the environment at most $Oleft( fracS2Aepsilon2textpolylogleft(fracSAHepsilon2
arXiv Detail & Related papers (2020-10-12T17:51:19Z) - A Provably Efficient Sample Collection Strategy for Reinforcement
Learning [123.69175280309226]
One of the challenges in online reinforcement learning (RL) is that the agent needs to trade off the exploration of the environment and the exploitation of the samples to optimize its behavior.
We propose to tackle the exploration-exploitation problem following a decoupled approach composed of: 1) An "objective-specific" algorithm that prescribes how many samples to collect at which states, as if it has access to a generative model (i.e., sparse simulator of the environment); 2) An "objective-agnostic" sample collection responsible for generating the prescribed samples as fast as possible.
arXiv Detail & Related papers (2020-07-13T15:17:35Z) - Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration.
We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration.
We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.