Related papers: Exploration by Running Away from the Past

Exploration by Running Away from the Past

URL: http://arxiv.org/abs/2411.14085v1
Date: Thu, 21 Nov 2024 12:51:09 GMT
Title: Exploration by Running Away from the Past
Authors: Paul-Antoine Le Tolguenec, Yann Besse, Florent Teichteil-Koenigsbuch, Dennis G. Wilson, Emmanuel Rachelson,
Abstract summary: We consider exploration as a problem of maximizing the Shannon entropy of the state occupation measure. This is done by maximizing a sequence of divergences between distributions representing an agent's past behavior and its current behavior. We demonstrate that by encouraging the agent to explore by actively distancing itself from past experiences, it can effectively explore mazes and a wide range of behaviors on robotic manipulation and locomotion tasks.
Score: 5.062282108230929
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The ability to explore efficiently and effectively is a central challenge of reinforcement learning. In this work, we consider exploration through the lens of information theory. Specifically, we cast exploration as a problem of maximizing the Shannon entropy of the state occupation measure. This is done by maximizing a sequence of divergences between distributions representing an agent's past behavior and its current behavior. Intuitively, this encourages the agent to explore new behaviors that are distinct from past behaviors. Hence, we call our method RAMP, for ``$\textbf{R}$unning $\textbf{A}$way fro$\textbf{m}$ the $\textbf{P}$ast.'' A fundamental question of this method is the quantification of the distribution change over time. We consider both the Kullback-Leibler divergence and the Wasserstein distance to quantify divergence between successive state occupation measures, and explain why the former might lead to undesirable exploratory behaviors in some tasks. We demonstrate that by encouraging the agent to explore by actively distancing itself from past experiences, it can effectively explore mazes and a wide range of behaviors on robotic manipulation and locomotion tasks.

Related papers

Exploitation Is All You Need... for Exploration [0.0]
We show that an agent trained solely to maximize a greedy (exploitation-only) objective can nonetheless exhibit emergent exploratory behavior.<n>Under the right prerequisites, exploration and exploitation need not be treated as objectives but can emerge from a unified reward-maximization process.
arXiv Detail & Related papers (2025-08-02T09:42:59Z)
MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization [91.80034860399677]
Reinforcement learning algorithms aim to balance exploiting the current best strategy with exploring new options that could lead to higher rewards. We introduce a framework, MaxInfoRL, for balancing intrinsic and extrinsic exploration. We show that our approach achieves sublinear regret in the simplified setting of multi-armed bandits.
arXiv Detail & Related papers (2024-12-16T18:59:53Z)
A Temporally Correlated Latent Exploration for Reinforcement Learning [4.1101087490516575]
Temporally Correlated Latent Exploration (TeCLE) is a novel intrinsic reward formulation that employs an action-conditioned latent space and temporal correlation. We find that the injected temporal correlation determines the exploratory behaviors of agents. We prove that the proposed TeCLE can be robust to the Noisy TV andity in benchmark environments.
arXiv Detail & Related papers (2024-12-06T04:38:43Z)
Deterministic Exploration via Stationary Bellman Error Maximization [6.474106100512158]
Exploration is a crucial and distinctive aspect of reinforcement learning (RL) In this paper, we introduce three modifications to stabilize the latter and arrive at a deterministic exploration policy. Our experimental results show that our approach can outperform $varepsilon$-greedy in dense and sparse reward settings.
arXiv Detail & Related papers (2024-10-31T11:46:48Z)
Sim-to-Real Causal Transfer: A Metric Learning Approach to Causally-Aware Interaction Representations [62.48505112245388]
We take an in-depth look at the causal awareness of modern representations of agent interactions. We show that recent representations are already partially resilient to perturbations of non-causal agents. We propose a metric learning approach that regularizes latent representations with causal annotations.
arXiv Detail & Related papers (2023-12-07T18:57:03Z)
Maximum State Entropy Exploration using Predecessor and Successor Representations [17.732962106114478]
Animals have a developed ability to explore that aids them in important tasks such as locating food. We propose $etapsi$-Learning, a method to learn efficient exploratory policies by conditioning on past episodic experience.
arXiv Detail & Related papers (2023-06-26T16:08:26Z)
Deconstructing deep active inference [2.236663830879273]
Active inference is a theory of perception, learning and decision making. The goal of this activity is to solve more complicated tasks using deep active inference.
arXiv Detail & Related papers (2023-03-02T22:39:56Z)
Tesseract: Tensorised Actors for Multi-Agent Reinforcement Learning [92.05556163518999]
MARL exacerbates matters by imposing various constraints on communication and observability. For value-based methods, it poses challenges in accurately representing the optimal value function. For policy gradient methods, it makes training the critic difficult and exacerbates the problem of the lagging critic. We show that from a learning theory perspective, both problems can be addressed by accurately representing the associated action-value function.
arXiv Detail & Related papers (2021-05-31T23:08:05Z)
Exploration and Incentives in Reinforcement Learning [107.42240386544633]
We consider complex exploration problems, where each agent faces the same (but unknown) MDP. Agents control the choice of policies, whereas an algorithm can only issue recommendations. We design an algorithm which explores all reachable states in the MDP.
arXiv Detail & Related papers (2021-02-28T00:15:53Z)
Temporal Difference Uncertainties as a Signal for Exploration [76.6341354269013]
An effective approach to exploration in reinforcement learning is to rely on an agent's uncertainty over the optimal policy. In this paper, we highlight that value estimates are easily biased and temporally inconsistent. We propose a novel method for estimating uncertainty over the value function that relies on inducing a distribution over temporal difference errors.
arXiv Detail & Related papers (2020-10-05T18:11:22Z)
Primal Wasserstein Imitation Learning [44.87651595571687]
We propose a new Imitation Learning (IL) method based on a conceptually simple algorithm: Primal Wasserstein Imitation Learning (PWIL) We show that we can recover expert behavior on a variety of continuous control tasks of the MuJoCo domain in a sample efficient manner in terms of agent interactions and of expert interactions with the environment.
arXiv Detail & Related papers (2020-06-08T15:30:11Z)
Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals [53.484562601127195]
We point out the inability to infer behavioral conclusions from probing results. We offer an alternative method that focuses on how the information is being used, rather than on what information is encoded.
arXiv Detail & Related papers (2020-06-01T15:00:11Z)
Maximizing Information Gain in Partially Observable Environments via Prediction Reward [64.24528565312463]
This paper tackles the challenge of using belief-based rewards for a deep RL agent. We derive the exact error between negative entropy and the expected prediction reward. This insight provides theoretical motivation for several fields using prediction rewards.
arXiv Detail & Related papers (2020-05-11T08:13:49Z)
Never Give Up: Learning Directed Exploration Strategies [63.19616370038824]
We propose a reinforcement learning agent to solve hard exploration games by learning a range of directed exploratory policies. We construct an episodic memory-based intrinsic reward using k-nearest neighbors over the agent's recent experience to train the directed exploratory policies. A self-supervised inverse dynamics model is used to train the embeddings of the nearest neighbour lookup, biasing the novelty signal towards what the agent can control.
arXiv Detail & Related papers (2020-02-14T13:57:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.