Near-Optimal Algorithms for Autonomous Exploration and Multi-Goal
Stochastic Shortest Path
- URL: http://arxiv.org/abs/2205.10729v1
- Date: Sun, 22 May 2022 03:54:15 GMT
- Title: Near-Optimal Algorithms for Autonomous Exploration and Multi-Goal
Stochastic Shortest Path
- Authors: Haoyuan Cai, Tengyu Ma, Simon Du
- Abstract summary: We revisit the incremental autonomous exploration problem proposed by Lim & Auer (2012).
In this setting, the agent aims to learn a set of near-optimal goal-conditioned policies to reach the $L$-controllable states.
We introduce a new algorithm with stronger sample bounds than existing ones.
- Score: 26.27529098205787
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We revisit the incremental autonomous exploration problem proposed by Lim &
Auer (2012). In this setting, the agent aims to learn a set of near-optimal
goal-conditioned policies to reach the $L$-controllable states: states that are
incrementally reachable from an initial state $s_0$ within $L$ steps in
expectation. We introduce a new algorithm with stronger sample complexity
bounds than existing ones. Furthermore, we also prove the first lower bound for
the autonomous exploration problem. In particular, the lower bound implies that
our proposed algorithm, Value-Aware Autonomous Exploration, is nearly
minimax-optimal when the number of $L$-controllable states grows polynomially
with respect to $L$. Key in our algorithm design is a connection between
autonomous exploration and multi-goal stochastic shortest path, a new problem
that naturally generalizes the classical stochastic shortest path problem. This
new problem and its connection to autonomous exploration can be of independent
interest.
Related papers
- Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs)
At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward.
We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z) - Layered State Discovery for Incremental Autonomous Exploration [106.37656068276901]
Layered Autonomous Exploration (LAE) is a novel algorithm for AX that attains a sample complexity of $tildemathcalO(LSrightarrow_LAln12(Srightarrow_LAln12(Srightarrow_LAln12(Srightarrow_LAln12(Srightar row_LAln12(Srightarrow_LAln12
arXiv Detail & Related papers (2023-02-07T22:58:12Z) - Adaptive Multi-Goal Exploration [118.40427257364729]
We show how AdaGoal can be used to tackle the objective of learning an $epsilon$-optimal goal-conditioned policy.
AdaGoal is anchored in the high-level algorithmic structure of existing methods for goal-conditioned deep reinforcement learning.
arXiv Detail & Related papers (2021-11-23T17:59:50Z) - Improved Sample Complexity for Incremental Autonomous Exploration in
MDPs [132.88757893161699]
We learn the set of $epsilon$-optimal goal-conditioned policies attaining all states that are incrementally reachable within $L$ steps.
DisCo is the first algorithm that can return an $epsilon/c_min$-optimal policy for any cost-sensitive shortest-path problem.
arXiv Detail & Related papers (2020-12-29T14:06:09Z) - Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration.
We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration.
We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.