Learning to Explore in Diverse Reward Settings via Temporal-Difference-Error Maximization
- URL: http://arxiv.org/abs/2506.13345v1
- Date: Mon, 16 Jun 2025 10:36:24 GMT
- Title: Learning to Explore in Diverse Reward Settings via Temporal-Difference-Error Maximization
- Authors: Sebastian Griesbach, Carlo D'Eramo,
- Abstract summary: We propose a novel exploration method, Stable Error-seeking Exploration (SEE), that is robust across dense, sparse, and exploration-adverse reward settings.<n>Our method introduces three design choices to mitigate instability caused by far-off-policy learning, the conflict of interest of maximizing the cumulative TD-error in an episodic setting, and the non-stationary nature of TD-errors.
- Score: 6.474106100512158
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Numerous heuristics and advanced approaches have been proposed for exploration in different settings for deep reinforcement learning. Noise-based exploration generally fares well with dense-shaped rewards and bonus-based exploration with sparse rewards. However, these methods usually require additional tuning to deal with undesirable reward settings by adjusting hyperparameters and noise distributions. Rewards that actively discourage exploration, i.e., with an action cost and no other dense signal to follow, can pose a major challenge. We propose a novel exploration method, Stable Error-seeking Exploration (SEE), that is robust across dense, sparse, and exploration-adverse reward settings. To this endeavor, we revisit the idea of maximizing the TD-error as a separate objective. Our method introduces three design choices to mitigate instability caused by far-off-policy learning, the conflict of interest of maximizing the cumulative TD-error in an episodic setting, and the non-stationary nature of TD-errors. SEE can be combined with off-policy algorithms without modifying the optimization pipeline of the original objective. In our experimental analysis, we show that a Soft-Actor Critic agent with the addition of SEE performs robustly across three diverse reward settings in a variety of tasks without hyperparameter adjustments.
Related papers
- MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization [91.80034860399677]
Reinforcement learning algorithms aim to balance exploiting the current best strategy with exploring new options that could lead to higher rewards.<n>We introduce a framework, MaxInfoRL, for balancing intrinsic and extrinsic exploration.<n>We show that our approach achieves sublinear regret in the simplified setting of multi-armed bandits.
arXiv Detail & Related papers (2024-12-16T18:59:53Z) - Deterministic Exploration via Stationary Bellman Error Maximization [6.474106100512158]
Exploration is a crucial and distinctive aspect of reinforcement learning (RL)
In this paper, we introduce three modifications to stabilize the latter and arrive at a deterministic exploration policy.
Our experimental results show that our approach can outperform $varepsilon$-greedy in dense and sparse reward settings.
arXiv Detail & Related papers (2024-10-31T11:46:48Z) - On-policy Actor-Critic Reinforcement Learning for Multi-UAV Exploration [0.7373617024876724]
Unmanned aerial vehicles (UAVs) have become increasingly popular in various fields, including precision agriculture, search and rescue, and remote sensing.
This study aims to address this challenge by utilizing on-policy Reinforcement Learning (RL) with Proximal Policy Optimization (PPO) to explore the two dimensional area of interest with multiple UAVs.
The proposed solution includes actor-critic networks using deep convolutional neural networks (CNN) and long short-term memory (LSTM) for identifying the UAVs and areas that have already been covered.
arXiv Detail & Related papers (2024-09-17T10:36:46Z) - Random Latent Exploration for Deep Reinforcement Learning [71.88709402926415]
We introduce Random Latent Exploration (RLE), a simple yet effective exploration strategy in reinforcement learning (RL)<n>On average, RLE outperforms noise-based methods, which perturb the agent's actions, and bonus-based exploration, which rewards the agent for attempting novel behaviors.<n>RLE is as simple as noise-based methods, as it avoids complex bonus calculations but retains the deep exploration benefits of bonus-based methods.
arXiv Detail & Related papers (2024-07-18T17:55:22Z) - DeNoising-MOT: Towards Multiple Object Tracking with Severe Occlusions [52.63323657077447]
We propose DNMOT, an end-to-end trainable DeNoising Transformer for multiple object tracking.
Specifically, we augment the trajectory with noises during training and make our model learn the denoising process in an encoder-decoder architecture.
We conduct extensive experiments on the MOT17, MOT20, and DanceTrack datasets, and the experimental results show that our method outperforms previous state-of-the-art methods by a clear margin.
arXiv Detail & Related papers (2023-09-09T04:40:01Z) - On Reward-Free RL with Kernel and Neural Function Approximations:
Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function.
We tackle this problem under the context of function approximation, leveraging powerful function approximators.
We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z) - ADER:Adapting between Exploration and Robustness for Actor-Critic
Methods [8.750251598581102]
We show that TD3's performance lags behind the vanilla actor-critic methods in some primitive environments.
We propose a novel algorithm toward this problem that ADapts between Exploration and Robustness, namely ADER.
Experiments in several challenging environments demonstrate the supremacy of the proposed method in continuous control tasks.
arXiv Detail & Related papers (2021-09-08T05:48:39Z) - MADE: Exploration via Maximizing Deviation from Explored Regions [48.49228309729319]
In online reinforcement learning (RL), efficient exploration remains challenging in high-dimensional environments with sparse rewards.
We propose a new exploration approach via textitmaximizing the deviation of the occupancy of the next policy from the explored regions.
Our approach significantly improves sample efficiency over state-of-the-art methods.
arXiv Detail & Related papers (2021-06-18T17:57:00Z) - A Flow Base Bi-path Network for Cross-scene Video Crowd Understanding in
Aerial View [93.23947591795897]
In this paper, we strive to tackle the challenges and automatically understand the crowd from the visual data collected from drones.
To alleviate the background noise generated in cross-scene testing, a double-stream crowd counting model is proposed.
To tackle the crowd density estimation problem under extreme dark environments, we introduce synthetic data generated by game Grand Theft Auto V(GTAV)
arXiv Detail & Related papers (2020-09-29T01:48:24Z) - Dynamic Subgoal-based Exploration via Bayesian Optimization [7.297146495243708]
Reinforcement learning in sparse-reward navigation environments is challenging and poses a need for effective exploration.
We propose a cost-aware Bayesian optimization approach that efficiently searches over a class of dynamic subgoal-based exploration strategies.
An experimental evaluation demonstrates that the new approach outperforms existing baselines across a number of problem domains.
arXiv Detail & Related papers (2019-10-21T04:24:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.