Improved Sample Complexity for Incremental Autonomous Exploration in
MDPs
- URL: http://arxiv.org/abs/2012.14755v1
- Date: Tue, 29 Dec 2020 14:06:09 GMT
- Title: Improved Sample Complexity for Incremental Autonomous Exploration in
MDPs
- Authors: Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric
- Abstract summary: We learn the set of $epsilon$-optimal goal-conditioned policies attaining all states that are incrementally reachable within $L$ steps.
DisCo is the first algorithm that can return an $epsilon/c_min$-optimal policy for any cost-sensitive shortest-path problem.
- Score: 132.88757893161699
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate the exploration of an unknown environment when no reward
function is provided. Building on the incremental exploration setting
introduced by Lim and Auer [1], we define the objective of learning the set of
$\epsilon$-optimal goal-conditioned policies attaining all states that are
incrementally reachable within $L$ steps (in expectation) from a reference
state $s_0$. In this paper, we introduce a novel model-based approach that
interleaves discovering new states from $s_0$ and improving the accuracy of a
model estimate that is used to compute goal-conditioned policies to reach newly
discovered states. The resulting algorithm, DisCo, achieves a sample complexity
scaling as $\tilde{O}(L^5 S_{L+\epsilon} \Gamma_{L+\epsilon} A \epsilon^{-2})$,
where $A$ is the number of actions, $S_{L+\epsilon}$ is the number of states
that are incrementally reachable from $s_0$ in $L+\epsilon$ steps, and
$\Gamma_{L+\epsilon}$ is the branching factor of the dynamics over such states.
This improves over the algorithm proposed in [1] in both $\epsilon$ and $L$ at
the cost of an extra $\Gamma_{L+\epsilon}$ factor, which is small in most
environments of interest. Furthermore, DisCo is the first algorithm that can
return an $\epsilon/c_{\min}$-optimal policy for any cost-sensitive
shortest-path problem defined on the $L$-reachable states with minimum cost
$c_{\min}$. Finally, we report preliminary empirical results confirming our
theoretical findings.
Related papers
- Finding good policies in average-reward Markov Decision Processes without prior knowledge [19.89784209009327]
We revisit the identification of an $varepsilon$-optimal policy in average-reward Decision (MDP)
By relying on a diameter estimation procedure, we propose the first algorithm for $(varepsilon,delta)$-PAC-PAC policy identification.
arXiv Detail & Related papers (2024-05-27T12:24:14Z) - Layered State Discovery for Incremental Autonomous Exploration [106.37656068276901]
Layered Autonomous Exploration (LAE) is a novel algorithm for AX that attains a sample complexity of $tildemathcalO(LSrightarrow_LAln12(Srightarrow_LAln12(Srightarrow_LAln12(Srightarrow_LAln12(Srightar row_LAln12(Srightarrow_LAln12
arXiv Detail & Related papers (2023-02-07T22:58:12Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Gap-Dependent Unsupervised Exploration for Reinforcement Learning [40.990467706237396]
We present an efficient algorithm for task-agnostic reinforcement learning.
The algorithm takes only $widetildemathcalO (1/epsilon cdot (H3SA / rho + H4 S2 A) )$ episodes of exploration.
We show that, information-theoretically, this bound is nearly tight for $rho Theta (1/(HS))$ and $H>1$.
arXiv Detail & Related papers (2021-08-11T20:42:46Z) - Private Stochastic Convex Optimization: Optimal Rates in $\ell_1$
Geometry [69.24618367447101]
Up to logarithmic factors the optimal excess population loss of any $(varepsilon,delta)$-differently private is $sqrtlog(d)/n + sqrtd/varepsilon n.$
We show that when the loss functions satisfy additional smoothness assumptions, the excess loss is upper bounded (up to logarithmic factors) by $sqrtlog(d)/n + (log(d)/varepsilon n)2/3.
arXiv Detail & Related papers (2021-03-02T06:53:44Z) - Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation [30.137884459159107]
We consider the question of learning $Q$-function in a sample efficient manner for reinforcement learning with continuous state and action spaces.
We develop a simple, iterative learning algorithm that finds $epsilon$-Schmidt $Q$-function with sample complexity of $widetildeO(frac1epsilonmax(d1), d_2)+2)$ when the optimal $Q$-function has low rank $r$ and the factor $$ is below a certain threshold.
arXiv Detail & Related papers (2020-06-11T00:55:35Z) - Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample
Complexity [59.34067736545355]
Given an MDP with $S$ states, $A$ actions, the discount factor $gamma in (0,1)$, and an approximation threshold $epsilon > 0$, we provide a model-free algorithm to learn an $epsilon$-optimal policy.
For small enough $epsilon$, we show an improved algorithm with sample complexity.
arXiv Detail & Related papers (2020-06-06T13:34:41Z) - Agnostic Q-learning with Function Approximation in Deterministic
Systems: Tight Bounds on Approximation Error and Sample Complexity [94.37110094442136]
We study the problem of agnostic $Q$-learning with function approximation in deterministic systems.
We show that if $delta = Oleft(rho/sqrtdim_Eright)$, then one can find the optimal policy using $Oleft(dim_Eright)$.
arXiv Detail & Related papers (2020-02-17T18:41:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.