Fast Rates for Maximum Entropy Exploration
- URL: http://arxiv.org/abs/2303.08059v2
- Date: Tue, 6 Jun 2023 13:35:42 GMT
- Title: Fast Rates for Maximum Entropy Exploration
- Authors: Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines,
Remi Munos, Alexey Naumov, Pierre Perrault, Yunhao Tang, Michal Valko, Pierre
Menard
- Abstract summary: We address the challenge of exploration in reinforcement learning (RL) when the agent operates in an unknown environment with sparse or no rewards.
We study the maximum entropy exploration problem two different types.
For visitation entropy, we propose a game-theoretic algorithm that has $widetildemathcalO(H3S2A/varepsilon2)$ sample complexity.
For the trajectory entropy, we propose a simple algorithm that has a sample of complexity of order $widetildemathcalO(mathrmpoly(S,
- Score: 52.946307632704645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address the challenge of exploration in reinforcement learning (RL) when
the agent operates in an unknown environment with sparse or no rewards. In this
work, we study the maximum entropy exploration problem of two different types.
The first type is visitation entropy maximization previously considered by
Hazan et al.(2019) in the discounted setting. For this type of exploration, we
propose a game-theoretic algorithm that has
$\widetilde{\mathcal{O}}(H^3S^2A/\varepsilon^2)$ sample complexity thus
improving the $\varepsilon$-dependence upon existing results, where $S$ is a
number of states, $A$ is a number of actions, $H$ is an episode length, and
$\varepsilon$ is a desired accuracy. The second type of entropy we study is the
trajectory entropy. This objective function is closely related to the
entropy-regularized MDPs, and we propose a simple algorithm that has a sample
complexity of order
$\widetilde{\mathcal{O}}(\mathrm{poly}(S,A,H)/\varepsilon)$. Interestingly, it
is the first theoretical result in RL literature that establishes the potential
statistical advantage of regularized MDPs for exploration. Finally, we apply
developed regularization techniques to reduce sample complexity of visitation
entropy maximization to $\widetilde{\mathcal{O}}(H^2SA/\varepsilon^2)$,
yielding a statistical separation between maximum entropy exploration and
reward-free exploration.
Related papers
- Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs)
At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward.
We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z) - Sublinear quantum algorithms for estimating von Neumann entropy [18.30551855632791]
We study the problem of obtaining estimates to within a multiplicative factor $gamma>1$ of the Shannon entropy of probability distributions and the von Neumann entropy of mixed quantum states.
We work with the quantum purified query access model, which can handle both classical probability distributions and mixed quantum states, and is the most general input model considered in the literature.
arXiv Detail & Related papers (2021-11-22T12:00:45Z) - Reward-Free Model-Based Reinforcement Learning with Linear Function
Approximation [92.99933928528797]
We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs)
In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy.
We show that to obtain an $epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $tilde O(H4d(H + d)epsilon-2)$ episodes.
arXiv Detail & Related papers (2021-10-12T23:03:58Z) - Gap-Dependent Unsupervised Exploration for Reinforcement Learning [40.990467706237396]
We present an efficient algorithm for task-agnostic reinforcement learning.
The algorithm takes only $widetildemathcalO (1/epsilon cdot (H3SA / rho + H4 S2 A) )$ episodes of exploration.
We show that, information-theoretically, this bound is nearly tight for $rho Theta (1/(HS))$ and $H>1$.
arXiv Detail & Related papers (2021-08-11T20:42:46Z) - Improved Sample Complexity for Incremental Autonomous Exploration in
MDPs [132.88757893161699]
We learn the set of $epsilon$-optimal goal-conditioned policies attaining all states that are incrementally reachable within $L$ steps.
DisCo is the first algorithm that can return an $epsilon/c_min$-optimal policy for any cost-sensitive shortest-path problem.
arXiv Detail & Related papers (2020-12-29T14:06:09Z) - Adaptive Reward-Free Exploration [48.98199700043158]
We show that our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994.
We further investigate the relative complexities of reward-free exploration and best-policy identification.
arXiv Detail & Related papers (2020-06-11T09:58:03Z) - Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration.
We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration.
We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.