Fugu-MT 論文翻訳(概要): Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration

論文の概要: Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration

arxiv url: http://arxiv.org/abs/2603.22273v2
Date: Fri, 27 Mar 2026 17:44:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-30 21:49:48.125244
Title: Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration
Title（参考訳）: 探索と政策最適化の分離:ハード探索のための不確実性誘導木探索
Authors: Zakaria Mhammedi, James Cohan,
Abstract要約: 本稿では,探査段階におけるRLの活用と回避を明確に分離する新たなパラダイムを提案する。政策最適化のオーバーヘッドを取り除くことにより,本手法は,ハードなAtariベンチマーク上での本質的なモチベーションベースラインよりも,桁違いに効率よく探索する。得られた軌跡を既存の教師付き後方学習アルゴリズムを用いて,展開可能なポリシに抽出できることを実証した。
参考スコア（独自算出の注目度）: 12.531650952835493
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The process of discovery requires active exploration -- the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand state coverage may be inefficient. In this paper, we propose a new paradigm that explicitly separates exploration from exploitation and bypasses RL during the exploration phase. Our method uses a tree-search strategy inspired by the Go-With-The-Winner algorithm, paired with a measure of epistemic uncertainty to systematically drive exploration. By removing the overhead of policy optimization, our approach explores an order of magnitude more efficiently than standard intrinsic motivation baselines on hard Atari benchmarks. Further, we demonstrate that the discovered trajectories can be distilled into deployable policies using existing supervised backward learning algorithms, achieving state-of-the-art scores by a wide margin on Montezuma's Revenge, Pitfall!, and Venture without relying on domain-specific knowledge. Finally, we demonstrate the generality of our framework in high-dimensional continuous action spaces by solving the MuJoCo Adroit dexterous manipulation and AntMaze tasks in a sparse-reward setting, directly from image observations and without expert demonstrations or offline datasets. To the best of our knowledge, this has not been achieved before for the Adroit tasks.
Abstract（参考訳）: 発見のプロセスには活発な探索が必要です -- 新しい情報収集の行為です。しかし、効率的な自律探査は未解決の問題のままである。支配的なパラダイムは、強化学習(Reinforcement Learning, RL)を使用して、本質的な動機を持つエージェントを訓練し、本質的な報酬と本質的な報酬の複合目的を最大化する。我々は,この手法が不要なオーバーヘッドを引き起こすことを示唆する: 正確なタスク実行にはポリシーの最適化が必要であるが,状態カバレッジを拡大するためにのみそのような機械を用いることは,非効率である可能性がある。本稿では,探査段階におけるRLの活用と回避を明確に分離する新たなパラダイムを提案する。提案手法は,Go-With-The-Winnerアルゴリズムにインスパイアされた木探索戦略を用いて,探索を体系的に進めるために,疫学的不確実性の尺度と組み合わせた。政策最適化のオーバーヘッドを取り除くことにより,本手法は,ハードなAtariベンチマーク上での本質的なモチベーションベースラインよりも,桁違いに効率よく探索する。さらに,既存の教師付き後方学習アルゴリズムを用いて,モンテズマのRevenge,Pitfall!,Ventureにおいて,最先端のスコアを広いマージンで達成し,ドメイン固有の知識に頼らずに,これらのトラジェクトリを展開可能なポリシに抽出できることを実証した。最後に,MuJoCo Adroit dexterous操作とAntMazeタスクを,画像観察から直接,専門家によるデモンストレーションやオフラインデータセットなしでスパース・リワード環境で解くことで,高次元連続行動空間における我々のフレームワークの汎用性を実証する。私たちの知る限りでは、これはAdroitタスクではこれまで達成されていませんでした。

論文の概要: Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration

関連論文リスト