Discovering and Exploiting Sparse Rewards in a Learned Behavior Space
- URL: http://arxiv.org/abs/2111.01919v2
- Date: Tue, 26 Sep 2023 21:42:27 GMT
- Title: Discovering and Exploiting Sparse Rewards in a Learned Behavior Space
- Authors: Giuseppe Paolo, Miranda Coninx, Alban Laflaqui\`ere, and Stephane
Doncieux
- Abstract summary: Learning optimal policies in sparse rewards settings is difficult as the learning agent has little to no feedback on the quality of its actions.
We introduce STAX, an algorithm designed to learn a behavior space on-the-fly and to explore it while efficiently optimizing any reward discovered.
- Score: 0.46736439782713946
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning optimal policies in sparse rewards settings is difficult as the
learning agent has little to no feedback on the quality of its actions. In
these situations, a good strategy is to focus on exploration, hopefully leading
to the discovery of a reward signal to improve on. A learning algorithm capable
of dealing with this kind of settings has to be able to (1) explore possible
agent behaviors and (2) exploit any possible discovered reward. Efficient
exploration algorithms have been proposed that require to define a behavior
space, that associates to an agent its resulting behavior in a space that is
known to be worth exploring. The need to define this space is a limitation of
these algorithms. In this work, we introduce STAX, an algorithm designed to
learn a behavior space on-the-fly and to explore it while efficiently
optimizing any reward discovered. It does so by separating the exploration and
learning of the behavior space from the exploitation of the reward through an
alternating two-steps process. In the first step, STAX builds a repertoire of
diverse policies while learning a low-dimensional representation of the
high-dimensional observations generated during the policies evaluation. In the
exploitation step, emitters are used to optimize the performance of the
discovered rewarding solutions. Experiments conducted on three different sparse
reward environments show that STAX performs comparably to existing baselines
while requiring much less prior information about the task as it autonomously
builds the behavior space.
Related papers
- Boosting Exploration in Actor-Critic Algorithms by Incentivizing
Plausible Novel States [9.210923191081864]
Actor-critic (AC) algorithms are a class of model-free deep reinforcement learning algorithms.
We propose a new method to boost exploration through an intrinsic reward, based on measurement of a state's novelty.
With incentivized exploration of plausible novel states, an AC algorithm is able to improve its sample efficiency and hence training performance.
arXiv Detail & Related papers (2022-10-01T07:07:11Z) - Searching a High-Performance Feature Extractor for Text Recognition
Network [92.12492627169108]
We design a domain-specific search space by exploring principles for having good feature extractors.
As the space is huge and complexly structured, no existing NAS algorithms can be applied.
We propose a two-stage algorithm to effectively search in the space.
arXiv Detail & Related papers (2022-09-27T03:49:04Z) - Learning in Sparse Rewards settings through Quality-Diversity algorithms [1.4881159885040784]
This thesis focuses on the problem of sparse rewards with Quality-Diversity (QD) algorithms.
The first part of the thesis focuses on learning a representation of the space in which the diversity of the policies is evaluated.
The thesis continues with the introduction of the SERENE algorithm, a method that can efficiently focus on the interesting parts of the search space.
arXiv Detail & Related papers (2022-03-02T11:02:34Z) - Follow your Nose: Using General Value Functions for Directed Exploration
in Reinforcement Learning [5.40729975786985]
This paper explores the idea of combining exploration with auxiliary task learning using General Value Functions (GVFs) and a directed exploration strategy.
We provide a simple way to learn options (sequences of actions) instead of having to handcraft them, and demonstrate the performance advantage in three navigation tasks.
arXiv Detail & Related papers (2022-03-02T05:14:11Z) - MURAL: Meta-Learning Uncertainty-Aware Rewards for Outcome-Driven
Reinforcement Learning [65.52675802289775]
We show that an uncertainty aware classifier can solve challenging reinforcement learning problems.
We propose a novel method for computing the normalized maximum likelihood (NML) distribution.
We show that the resulting algorithm has a number of intriguing connections to both count-based exploration methods and prior algorithms for learning reward functions.
arXiv Detail & Related papers (2021-07-15T08:19:57Z) - MADE: Exploration via Maximizing Deviation from Explored Regions [48.49228309729319]
In online reinforcement learning (RL), efficient exploration remains challenging in high-dimensional environments with sparse rewards.
We propose a new exploration approach via textitmaximizing the deviation of the occupancy of the next policy from the explored regions.
Our approach significantly improves sample efficiency over state-of-the-art methods.
arXiv Detail & Related papers (2021-06-18T17:57:00Z) - Sparse Reward Exploration via Novelty Search and Emitters [55.41644538483948]
We introduce the SparsE Reward Exploration via Novelty and Emitters (SERENE) algorithm.
SERENE separates the search space exploration and reward exploitation into two alternating processes.
A meta-scheduler allocates a global computational budget by alternating between the two processes.
arXiv Detail & Related papers (2021-02-05T12:34:54Z) - Provably Efficient Reward-Agnostic Navigation with Linear Value
Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function.
We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z) - Zeroth-Order Supervised Policy Improvement [94.0748002906652]
Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL)
We propose Zeroth-Order Supervised Policy Improvement (ZOSPI)
ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods.
arXiv Detail & Related papers (2020-06-11T16:49:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.