Related papers: Beyond Noisy-TVs: Noise-Robust Exploration Via Learning Progress Monitoring

Beyond Noisy-TVs: Noise-Robust Exploration Via Learning Progress Monitoring

URL: http://arxiv.org/abs/2509.25438v1
Date: Mon, 29 Sep 2025 19:43:44 GMT
Title: Beyond Noisy-TVs: Noise-Robust Exploration Via Learning Progress Monitoring
Authors: Zhibo Hou, Zhiyu An, Wan Du,
Abstract summary: We propose a novel method for intrinsically-motivated exploration, named Learning Progress Monitoring (LPM)<n>During exploration, LPM rewards model improvements instead of prediction error or novelty, effectively rewards the agent for observing learnable transitions.<n>Results show that LPM's intrinsic reward converges faster, explores more states in the maze experiment, and achieves higher extrinsic reward in Atari.
Score: 6.90856330255878
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: When there exists an unlearnable source of randomness (noisy-TV) in the environment, a naively intrinsic reward driven exploring agent gets stuck at that source of randomness and fails at exploration. Intrinsic reward based on uncertainty estimation or distribution similarity, while eventually escapes noisy-TVs as time unfolds, suffers from poor sample efficiency and high computational cost. Inspired by recent findings from neuroscience that humans monitor their improvements during exploration, we propose a novel method for intrinsically-motivated exploration, named Learning Progress Monitoring (LPM). During exploration, LPM rewards model improvements instead of prediction error or novelty, effectively rewards the agent for observing learnable transitions rather than the unlearnable transitions. We introduce a dual-network design that uses an error model to predict the expected prediction error of the dynamics model in its previous iteration, and use the difference between the model errors of the current iteration and previous iteration to guide exploration. We theoretically show that the intrinsic reward of LPM is zero-equivariant and a monotone indicator of Information Gain (IG), and that the error model is necessary to achieve monotonicity correspondence with IG. We empirically compared LPM against state-of-the-art baselines in noisy environments based on MNIST, 3D maze with 160x120 RGB inputs, and Atari. Results show that LPM's intrinsic reward converges faster, explores more states in the maze experiment, and achieves higher extrinsic reward in Atari. This conceptually simple approach marks a shift-of-paradigm of noise-robust exploration. For code to reproduce our experiments, see https://github.com/Akuna23Matata/LPM_exploration

Related papers

A Model of Artificial Jagged Intelligence [0.0]
Generative AI systems often display highly uneven performance across tasks that appear nearby''<n>We call this phenomenon Artificial Jagged Intelligence (AJI)<n>This paper develops a tractable economic model of AJI that treats adoption as an information problem.
arXiv Detail & Related papers (2026-01-12T14:27:30Z)
Rate or Fate? RLV$^\varepsilon$R: Reinforcement Learning with Verifiable Noisy Rewards [2.0987013818856877]
Reinforcement learning with verifiable rewards (RLVR) is a simple but powerful paradigm for training LLMs.<n>In practice, however, the verifier is almost never clean-unit tests probe only limited corner cases.<n>We ask a pragmatic question: does the verification noise merely slow down the learning (rate), or can it flip the outcome (fate)?
arXiv Detail & Related papers (2026-01-07T21:31:26Z)
MIRA: Towards Mitigating Reward Hacking in Inference-Time Alignment of T2I Diffusion Models [86.07486858219137]
Diffusion models excel at generating images conditioned on text prompts.<n>The resulting images often do not satisfy user-specific criteria measured by scalar rewards such as Aesthetic Scores.<n>Recently, inference-time alignment via noise optimization has emerged as an efficient alternative.<n>We show that this approach suffers from reward hacking, where the model produces images that score highly, yet deviate significantly from the original prompt.
arXiv Detail & Related papers (2025-10-02T00:47:36Z)
VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal Reasoning [62.09195763860549]
Reinforcement learning with verifiable rewards (RLVR) improves reasoning in large language models (LLMs) but struggles with exploration.<n>We introduce $textbfVOGUE (Visual Uncertainty Guided Exploration)$, a novel method that shifts exploration from the output (text) to the input (visual) space.<n>Our work shows that grounding exploration in the inherent uncertainty of visual inputs is an effective strategy for improving multimodal reasoning.
arXiv Detail & Related papers (2025-10-01T20:32:08Z)
Anomalous Decision Discovery using Inverse Reinforcement Learning [3.3675535571071746]
Anomaly detection plays a critical role in Autonomous Vehicles (AVs) by identifying unusual behaviors through perception systems.<n>Current approaches, which often rely on predefined thresholds or supervised learning paradigms, exhibit reduced efficacy when confronted with unseen scenarios.<n>We present Trajectory-Reward Guided Adaptive Pre-training (TRAP), a novel IRL framework for anomaly detection.
arXiv Detail & Related papers (2025-07-06T17:01:02Z)
Test-Time Scaling of Diffusion Models via Noise Trajectory Search [10.8507840358202]
We introduce an $epsilon$-greedy search algorithm that globally explores at extreme timesteps and locally exploits during the intermediate steps where de-mixing occurs.<n>Experiments on EDM and Stable Diffusion reveal state-of-the-art scores for class-conditioned/text-to-image generation.
arXiv Detail & Related papers (2025-05-24T19:13:29Z)
The Dark Side of Rich Rewards: Understanding and Mitigating Noise in VLM Rewards [31.806143589311652]
Vision-Language Models (VLMs) are increasingly used to generate reward signals for training embodied agents.<n>Our research reveals that agents guided by VLM rewards often underperform compared to those employing only intrinsic rewards.<n>We introduce BiMI, a novel reward function designed to mitigate noise.
arXiv Detail & Related papers (2024-09-24T09:45:20Z)
Random Latent Exploration for Deep Reinforcement Learning [71.88709402926415]
We introduce Random Latent Exploration (RLE), a simple yet effective exploration strategy in reinforcement learning (RL)<n>On average, RLE outperforms noise-based methods, which perturb the agent's actions, and bonus-based exploration, which rewards the agent for attempting novel behaviors.<n>RLE is as simple as noise-based methods, as it avoids complex bonus calculations but retains the deep exploration benefits of bonus-based methods.
arXiv Detail & Related papers (2024-07-18T17:55:22Z)
LARA: A Light and Anti-overfitting Retraining Approach for Unsupervised Time Series Anomaly Detection [49.52429991848581]
We propose a Light and Anti-overfitting Retraining Approach (LARA) for deep variational auto-encoder based time series anomaly detection methods (VAEs) This work aims to make three novel contributions: 1) the retraining process is formulated as a convex problem and can converge at a fast rate as well as prevent overfitting; 2) designing a ruminate block, which leverages the historical data without the need to store them; and 3) mathematically proving that when fine-tuning the latent vector and reconstructed data, the linear formations can achieve the least adjusting errors between the ground truths and the fine-tuned ones.
arXiv Detail & Related papers (2023-10-09T12:36:16Z)
Rewarding Episodic Visitation Discrepancy for Exploration in Reinforcement Learning [64.8463574294237]
We propose Rewarding Episodic Visitation Discrepancy (REVD) as an efficient and quantified exploration method. REVD provides intrinsic rewards by evaluating the R'enyi divergence-based visitation discrepancy between episodes. It is tested on PyBullet Robotics Environments and Atari games.
arXiv Detail & Related papers (2022-09-19T08:42:46Z)
Latent World Models For Intrinsically Motivated Exploration [140.21871701134626]
We present a self-supervised representation learning method for image-based observations. We consider episodic and life-long uncertainties to guide the exploration of partially observable environments.
arXiv Detail & Related papers (2020-10-05T19:47:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.