Gap-Dependent Unsupervised Exploration for Reinforcement Learning
- URL: http://arxiv.org/abs/2108.05439v1
- Date: Wed, 11 Aug 2021 20:42:46 GMT
- Title: Gap-Dependent Unsupervised Exploration for Reinforcement Learning
- Authors: Jingfeng Wu, Vladimir Braverman, Lin F. Yang
- Abstract summary: We present an efficient algorithm for task-agnostic reinforcement learning.
The algorithm takes only $widetildemathcalO (1/epsilon cdot (H3SA / rho + H4 S2 A) )$ episodes of exploration.
We show that, information-theoretically, this bound is nearly tight for $rho Theta (1/(HS))$ and $H>1$.
- Score: 40.990467706237396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For the problem of task-agnostic reinforcement learning (RL), an agent first
collects samples from an unknown environment without the supervision of reward
signals, then is revealed with a reward and is asked to compute a corresponding
near-optimal policy. Existing approaches mainly concern the worst-case
scenarios, in which no structural information of the reward/transition-dynamics
is utilized. Therefore the best sample upper bound is
$\propto\widetilde{\mathcal{O}}(1/\epsilon^2)$, where $\epsilon>0$ is the
target accuracy of the obtained policy, and can be overly pessimistic. To
tackle this issue, we provide an efficient algorithm that utilizes a gap
parameter, $\rho>0$, to reduce the amount of exploration. In particular, for an
unknown finite-horizon Markov decision process, the algorithm takes only
$\widetilde{\mathcal{O}} (1/\epsilon \cdot (H^3SA / \rho + H^4 S^2 A) )$
episodes of exploration, and is able to obtain an $\epsilon$-optimal policy for
a post-revealed reward with sub-optimality gap at least $\rho$, where $S$ is
the number of states, $A$ is the number of actions, and $H$ is the length of
the horizon, obtaining a nearly \emph{quadratic saving} in terms of $\epsilon$.
We show that, information-theoretically, this bound is nearly tight for $\rho <
\Theta(1/(HS))$ and $H>1$. We further show that
$\propto\widetilde{\mathcal{O}}(1)$ sample bound is possible for $H=1$ (i.e.,
multi-armed bandit) or with a sampling simulator, establishing a stark
separation between those settings and the RL setting.
Related papers
- Multiple-policy Evaluation via Density Estimation [30.914344538340412]
We propose an algorithm named $mathrmCAESAR$ for this problem.
Up to low order and logarithmic terms $mathrmCAESAR$ achieves a sample complexity $tildeOleft(fracH4epsilon2sum_h=1Hmax_kin[K]sum_s,afrac(d_hpik(s,a))2mu*_h(s,a)right)$, where $dpi
arXiv Detail & Related papers (2024-03-29T23:55:25Z) - Estimation and Inference in Distributional Reinforcement Learning [28.253677740976197]
We show that a dataset of size $widetilde Oleft(frac|mathcalS||mathcalA|epsilon2 (1-gamma)4right)$ suffices to ensure the Kolmogorov metric and total variation metric between $hatetapi$ and $etapi$ is below $epsilon$ with high probability.
Our findings give rise to a unified approach to statistical inference of a wide class of statistical functionals of $etapi$.
arXiv Detail & Related papers (2023-09-29T14:14:53Z) - Sharper Model-free Reinforcement Learning for Average-reward Markov
Decision Processes [21.77276136591518]
We develop provably efficient model-free reinforcement learning (RL) algorithms for Markov Decision Processes (MDPs)
In the simulator setting, we propose a model-free RL algorithm that finds an $epsilon$-optimal policy using $widetildeO left(fracSAmathrmsp(h*)epsilon2+fracS2Amathrmsp(h*)epsilon2right)$ samples.
arXiv Detail & Related papers (2023-06-28T17:43:19Z) - Layered State Discovery for Incremental Autonomous Exploration [106.37656068276901]
Layered Autonomous Exploration (LAE) is a novel algorithm for AX that attains a sample complexity of $tildemathcalO(LSrightarrow_LAln12(Srightarrow_LAln12(Srightarrow_LAln12(Srightarrow_LAln12(Srightar row_LAln12(Srightarrow_LAln12
arXiv Detail & Related papers (2023-02-07T22:58:12Z) - Sampling from Log-Concave Distributions with Infinity-Distance
Guarantees and Applications to Differentially Private Optimization [33.38289436686841]
We present an algorithm that outputs a point from a distributionO(varepsilon)$close to $$ in infinity-distance.
We also present a "soft-pi" version of the Dikin walk which may be independent interest.
arXiv Detail & Related papers (2021-11-07T13:44:50Z) - Reward-Free Model-Based Reinforcement Learning with Linear Function
Approximation [92.99933928528797]
We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs)
In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy.
We show that to obtain an $epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $tilde O(H4d(H + d)epsilon-2)$ episodes.
arXiv Detail & Related papers (2021-10-12T23:03:58Z) - Improved Sample Complexity for Incremental Autonomous Exploration in
MDPs [132.88757893161699]
We learn the set of $epsilon$-optimal goal-conditioned policies attaining all states that are incrementally reachable within $L$ steps.
DisCo is the first algorithm that can return an $epsilon/c_min$-optimal policy for any cost-sensitive shortest-path problem.
arXiv Detail & Related papers (2020-12-29T14:06:09Z) - Nearly Minimax Optimal Reward-free Reinforcement Learning [88.75843804630772]
We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions.
We give a new efficient algorithm, textbfStaged textbfSampling + textbfTruncated textbfPlanning (algoname), which interacts with the environment at most $Oleft( fracS2Aepsilon2textpolylogleft(fracSAHepsilon2
arXiv Detail & Related papers (2020-10-12T17:51:19Z) - Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample
Complexity [59.34067736545355]
Given an MDP with $S$ states, $A$ actions, the discount factor $gamma in (0,1)$, and an approximation threshold $epsilon > 0$, we provide a model-free algorithm to learn an $epsilon$-optimal policy.
For small enough $epsilon$, we show an improved algorithm with sample complexity.
arXiv Detail & Related papers (2020-06-06T13:34:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.