Curious Explorer: a provable exploration strategy in Policy Learning
- URL: http://arxiv.org/abs/2106.15503v1
- Date: Tue, 29 Jun 2021 15:31:51 GMT
- Title: Curious Explorer: a provable exploration strategy in Policy Learning
- Authors: Marco Miani, Maurizio Parton, Marco Romito
- Abstract summary: We develop Curious Explorer, a novel and simple iterative state space exploration strategy.
Curious Explorer starts from $rho$, then using intrinsic rewards assigned to the set of poorly visited states produces a sequence of policies.
We show that Curious Explorer can improve performance in MDPs with challenging exploration.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Having access to an exploring restart distribution (the so-called wide
coverage assumption) is critical with policy gradient methods. This is due to
the fact that, while the objective function is insensitive to updates in
unlikely states, the agent may still need improvements in those states in order
to reach a nearly optimal payoff. For this reason, wide coverage is used in
some form when analyzing theoretical properties of practical policy gradient
methods. However, this assumption can be unfeasible in certain environments,
for instance when learning is online, or when restarts are possible only from a
fixed initial state. In these cases, classical policy gradient algorithms can
have very poor convergence properties and sample efficiency. In this paper, we
develop Curious Explorer, a novel and simple iterative state space exploration
strategy that can be used with any starting distribution $\rho$. Curious
Explorer starts from $\rho$, then using intrinsic rewards assigned to the set
of poorly visited states produces a sequence of policies, each one more
exploratory than the previous one in an informed way, and finally outputs a
restart model $\mu$ based on the state visitation distribution of the
exploratory policies. Curious Explorer is provable, in the sense that we
provide theoretical upper bounds on how often an optimal policy visits poorly
visited states. These bounds can be used to prove PAC convergence and sample
efficiency results when a PAC optimizer is plugged in Curious Explorer. This
allows to achieve global convergence and sample efficiency results without any
coverage assumption for REINFORCE, and potentially for any other policy
gradient method ensuring PAC convergence with wide coverage. Finally, we plug
(the output of) Curious Explorer into REINFORCE and TRPO, and show empirically
that it can improve performance in MDPs with challenging exploration.
Related papers
- Oracle-Efficient Reinforcement Learning for Max Value Ensembles [7.404901768256101]
Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, theoretically and experimentally.
In this work we aim to compete with the $textitmax-following policy$, which at each state follows the action of whichever constituent policy has the highest value.
Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies.
arXiv Detail & Related papers (2024-05-27T01:08:23Z) - Maximum State Entropy Exploration using Predecessor and Successor
Representations [17.732962106114478]
Animals have a developed ability to explore that aids them in important tasks such as locating food.
We propose $etapsi$-Learning, a method to learn efficient exploratory policies by conditioning on past episodic experience.
arXiv Detail & Related papers (2023-06-26T16:08:26Z) - A State-Distribution Matching Approach to Non-Episodic Reinforcement
Learning [61.406020873047794]
A major hurdle to real-world application arises from the development of algorithms in an episodic setting.
We propose a new method, MEDAL, that trains the backward policy to match the state distribution in the provided demonstrations.
Our experiments show that MEDAL matches or outperforms prior methods on three sparse-reward continuous control tasks.
arXiv Detail & Related papers (2022-05-11T00:06:29Z) - Explicit Explore, Exploit, or Escape ($E^4$): near-optimal
safety-constrained reinforcement learning in polynomial time [0.0]
Constrained Markov decision processes (CMDPs) can provide long-term safety constraints.
This paper proposes a model-based RL algorithm called Explicit Explore, Exploit, or Escape.
$E4$ explicitly separates exploitation, exploration, and escape CMDPs, allowing targeted policies for policy improvement.
arXiv Detail & Related papers (2021-11-14T17:43:25Z) - Reannealing of Decaying Exploration Based On Heuristic Measure in Deep
Q-Network [82.20059754270302]
We propose an algorithm based on the idea of reannealing, that aims at encouraging exploration only when it is needed.
We perform an illustrative case study showing that it has potential to both accelerate training and obtain a better policy.
arXiv Detail & Related papers (2020-09-29T20:40:00Z) - Provably Efficient Reward-Agnostic Navigation with Linear Value
Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function.
We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z) - PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient
Learning [35.044047991893365]
This work introduces the the Policy Cover-Policy Gradient (PC-PG) algorithm, which balances the exploration vs. exploitation tradeoff using an ensemble of policies (the policy cover)
We show that PC-PG has strong guarantees under model misspecification that go beyond the standard worst case $ell_infty$ assumptions.
We also complement the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings.
arXiv Detail & Related papers (2020-07-16T16:57:41Z) - Provably Good Batch Reinforcement Learning Without Great Exploration [51.51462608429621]
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks.
Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes.
We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
arXiv Detail & Related papers (2020-07-16T09:25:54Z) - Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration.
We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration.
We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z) - Provably Efficient Exploration in Policy Optimization [117.09887790160406]
This paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO)
OPPO achieves $tildeO(sqrtd2 H3 T )$ regret.
To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
arXiv Detail & Related papers (2019-12-12T08:40:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.