Fugu-MT 論文翻訳(概要): Beyond Noisy-TVs: Noise-Robust Exploration Via Learning Progress Monitoring

論文の概要: Beyond Noisy-TVs: Noise-Robust Exploration Via Learning Progress Monitoring

arxiv url: http://arxiv.org/abs/2509.25438v1
Date: Mon, 29 Sep 2025 19:43:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 17:09:04.295616
Title: Beyond Noisy-TVs: Noise-Robust Exploration Via Learning Progress Monitoring
Title（参考訳）: ノイズ・ロバスト探査と学習進捗モニタリング
Authors: Zhibo Hou, Zhiyu An, Wan Du,
Abstract要約: 本稿では,LPM(Learning Progress Monitoring)という本質的な動機付け探索手法を提案する。探索中、LPMは予測エラーや新規性ではなくモデルの改善に報いるため、学習可能な遷移を観察するために効果的にエージェントに報いる。その結果、LPMの内因性報酬はより早く収束し、迷路実験でより多くの状態を探究し、アタリにおける外因性報酬のより高い値を得ることが示された。
参考スコア（独自算出の注目度）: 6.90856330255878
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: When there exists an unlearnable source of randomness (noisy-TV) in the environment, a naively intrinsic reward driven exploring agent gets stuck at that source of randomness and fails at exploration. Intrinsic reward based on uncertainty estimation or distribution similarity, while eventually escapes noisy-TVs as time unfolds, suffers from poor sample efficiency and high computational cost. Inspired by recent findings from neuroscience that humans monitor their improvements during exploration, we propose a novel method for intrinsically-motivated exploration, named Learning Progress Monitoring (LPM). During exploration, LPM rewards model improvements instead of prediction error or novelty, effectively rewards the agent for observing learnable transitions rather than the unlearnable transitions. We introduce a dual-network design that uses an error model to predict the expected prediction error of the dynamics model in its previous iteration, and use the difference between the model errors of the current iteration and previous iteration to guide exploration. We theoretically show that the intrinsic reward of LPM is zero-equivariant and a monotone indicator of Information Gain (IG), and that the error model is necessary to achieve monotonicity correspondence with IG. We empirically compared LPM against state-of-the-art baselines in noisy environments based on MNIST, 3D maze with 160x120 RGB inputs, and Atari. Results show that LPM's intrinsic reward converges faster, explores more states in the maze experiment, and achieves higher extrinsic reward in Atari. This conceptually simple approach marks a shift-of-paradigm of noise-robust exploration. For code to reproduce our experiments, see https://github.com/Akuna23Matata/LPM_exploration
Abstract（参考訳）: 環境中に無作為なランダム性(ノイズ−TV)の源が存在する場合、本質的な報酬駆動探索剤がランダム性の源に留まり、探索に失敗する。不確実性推定や分布類似性に基づく本質的な報奨は、最終的に時間が経つにつれてノイズの多いテレビから逃れるが、サンプル効率の低下と高い計算コストに悩まされる。近年の神経科学の知見に触発されて,本質的に動機づけられた探索手法であるLearning Progress Monitoring (LPM)を提案する。探索中、LPMは予測エラーや新規性ではなくモデルの改善に報いる。本稿では,前回のイテレーションにおける動的モデルの予測誤差を予測するためにエラーモデルを用い,現在のイテレーションのモデルエラーと前回のイテレーションの差を利用して探索をガイドするデュアルネットワーク設計を提案する。理論的には、LPMの固有報酬はゼロ等価であり、情報ゲイン(IG)のモノトーン指標であり、IGとのモノトニック性対応を達成するためには誤差モデルが必要である。 MNIST, 3D maze with 160x120 RGB input, Atari に基づく雑音環境下でのLPMと最先端のベースラインを実証的に比較した。その結果、LPMの内因性報酬はより早く収束し、迷路実験でより多くの状態を探究し、アタリにおける外因性報酬のより高い値を得ることが示された。この概念的には単純なアプローチは、ノイズ・ロバスト探索のパラダイムシフトを表している。実験を再現するコードについては、https://github.com/Akuna23Matata/LPM_explorationを参照してください。

論文の概要: Beyond Noisy-TVs: Noise-Robust Exploration Via Learning Progress Monitoring

関連論文リスト