Related papers: Exploitation Is All You Need... for Exploration

Exploitation Is All You Need... for Exploration

URL: http://arxiv.org/abs/2508.01287v1
Date: Sat, 02 Aug 2025 09:42:59 GMT
Title: Exploitation Is All You Need... for Exploration
Authors: Micah Rentschler, Jesse Roberts,
Abstract summary: We show that an agent trained solely to maximize a greedy (exploitation-only) objective can nonetheless exhibit emergent exploratory behavior.<n>Under the right prerequisites, exploration and exploitation need not be treated as objectives but can emerge from a unified reward-maximization process.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Ensuring sufficient exploration is a central challenge when training meta-reinforcement learning (meta-RL) agents to solve novel environments. Conventional solutions to the exploration-exploitation dilemma inject explicit incentives such as randomization, uncertainty bonuses, or intrinsic rewards to encourage exploration. In this work, we hypothesize that an agent trained solely to maximize a greedy (exploitation-only) objective can nonetheless exhibit emergent exploratory behavior, provided three conditions are met: (1) Recurring Environmental Structure, where the environment features repeatable regularities that allow past experience to inform future choices; (2) Agent Memory, enabling the agent to retain and utilize historical interaction data; and (3) Long-Horizon Credit Assignment, where learning propagates returns over a time frame sufficient for the delayed benefits of exploration to inform current decisions. Through experiments in stochastic multi-armed bandits and temporally extended gridworlds, we observe that, when both structure and memory are present, a policy trained on a strictly greedy objective exhibits information-seeking exploratory behavior. We further demonstrate, through controlled ablations, that emergent exploration vanishes if either environmental structure or agent memory is absent (Conditions 1 & 2). Surprisingly, removing long-horizon credit assignment (Condition 3) does not always prevent emergent exploration-a result we attribute to the pseudo-Thompson Sampling effect. These findings suggest that, under the right prerequisites, exploration and exploitation need not be treated as orthogonal objectives but can emerge from a unified reward-maximization process.

Related papers

Intrinsically-Motivated Humans and Agents in Open-World Exploration [50.00331050937369]
We compare adults, children, and AI agents in a complex open-ended environment, Crafter.<n>We find that only Entropy and Empowerment are consistently positively correlated with human exploration progress.<n>We find preliminary evidence that private speech utterances, and particularly goal verbalizations, may aid exploration in children.
arXiv Detail & Related papers (2025-03-31T00:09:00Z)
Deterministic Exploration via Stationary Bellman Error Maximization [6.474106100512158]
Exploration is a crucial and distinctive aspect of reinforcement learning (RL) In this paper, we introduce three modifications to stabilize the latter and arrive at a deterministic exploration policy. Our experimental results show that our approach can outperform $varepsilon$-greedy in dense and sparse reward settings.
arXiv Detail & Related papers (2024-10-31T11:46:48Z)
Variable-Agnostic Causal Exploration for Reinforcement Learning [56.52768265734155]
We introduce a novel framework, Variable-Agnostic Causal Exploration for Reinforcement Learning (VACERL) Our approach automatically identifies crucial observation-action steps associated with key variables using attention mechanisms. It constructs the causal graph connecting these steps, which guides the agent towards observation-action pairs with greater causal influence on task completion.
arXiv Detail & Related papers (2024-07-17T09:45:27Z)
Successor-Predecessor Intrinsic Exploration [18.440869985362998]
We focus on exploration with intrinsic rewards, where the agent transiently augments the external rewards with self-generated intrinsic rewards. We propose Successor-Predecessor Intrinsic Exploration (SPIE), an exploration algorithm based on a novel intrinsic reward combining prospective and retrospective information. We show that SPIE yields more efficient and ethologically plausible exploratory behaviour in environments with sparse rewards and bottleneck states than competing methods.
arXiv Detail & Related papers (2023-05-24T16:02:51Z)
Discovering and Exploiting Sparse Rewards in a Learned Behavior Space [0.46736439782713946]
Learning optimal policies in sparse rewards settings is difficult as the learning agent has little to no feedback on the quality of its actions. We introduce STAX, an algorithm designed to learn a behavior space on-the-fly and to explore it while efficiently optimizing any reward discovered.
arXiv Detail & Related papers (2021-11-02T22:21:11Z)
Long-Term Exploration in Persistent MDPs [68.8204255655161]
We propose an exploration method called Rollback-Explore (RbExplore) In this paper, we propose an exploration method called Rollback-Explore (RbExplore), which utilizes the concept of the persistent Markov decision process. We test our algorithm in the hard-exploration Prince of Persia game, without rewards and domain knowledge.
arXiv Detail & Related papers (2021-09-21T13:47:04Z)
Is Curiosity All You Need? On the Utility of Emergent Behaviours from Curious Exploration [20.38772636693469]
We argue that merely using curiosity for fast environment exploration or as a bonus reward for a specific task does not harness the full potential of this technique. We propose to shift the focus towards retaining the behaviours which emerge during curiosity-based learning.
arXiv Detail & Related papers (2021-09-17T15:28:25Z)
Explore and Control with Adversarial Surprise [78.41972292110967]
Reinforcement learning (RL) provides a framework for learning goal-directed policies given user-specified rewards. We propose a new unsupervised RL technique based on an adversarial game which pits two policies against each other to compete over the amount of surprise an RL agent experiences. We show that our method leads to the emergence of complex skills by exhibiting clear phase transitions.
arXiv Detail & Related papers (2021-07-12T17:58:40Z)
Locally Persistent Exploration in Continuous Control Tasks with Sparse Rewards [0.0]
We propose a new exploration method, based on two intuitions. The choice of the next exploratory action should depend not only on the (Markovian) state of the environment, but also on the agent's trajectory. We discuss the theoretical properties of locally self-avoiding walks, and their ability to provide a kind of short-term memory.
arXiv Detail & Related papers (2020-12-26T01:30:26Z)
Fast active learning for pure exploration in reinforcement learning [48.98199700043158]
We show that bonuses that scale with $1/n$ bring faster learning rates, improving the known upper bounds with respect to the dependence on the horizon. We also show that with an improved analysis of the stopping time, we can improve by a factor $H$ the sample complexity in the best-policy identification setting.
arXiv Detail & Related papers (2020-07-27T11:28:32Z)
Planning to Explore via Self-Supervised World Models [120.31359262226758]
Plan2Explore is a self-supervised reinforcement learning agent. We present a new approach to self-supervised exploration and fast adaptation to new tasks. Without any training supervision or task-specific interaction, Plan2Explore outperforms prior self-supervised exploration methods.
arXiv Detail & Related papers (2020-05-12T17:59:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.