Fugu-MT 論文翻訳(概要): Maximum Entropy Exploration Without the Rollouts

論文の概要: Maximum Entropy Exploration Without the Rollouts

arxiv url: http://arxiv.org/abs/2603.12325v1
Date: Thu, 12 Mar 2026 18:00:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-16 17:38:11.701297
Title: Maximum Entropy Exploration Without the Rollouts
Title（参考訳）: ロールアウトのない最大エントロピー探査
Authors: Jacob Adamczyk, Adam Kamoski, Rahul V. Kulkarni,
Abstract要約: 探索問題の原則的反復は、誘導定常訪問分布のエントロピーを最大化するポリシーを見つけることである。本研究では,訪問分布自体から報酬が導出される本質的な平均回帰を考えることにより,最適ポリシが定常エントロピーを最大化する。この知見は、明示的なロールアウトと分布推定を避けるために、最大エントロピー探索問題の解法であるEVEに導かれる。
参考スコア（独自算出の注目度）: 5.008597638379228
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Efficient exploration remains a central challenge in reinforcement learning, serving as a useful pretraining objective for data collection, particularly when an external reward function is unavailable. A principled formulation of the exploration problem is to find policies that maximize the entropy of their induced steady-state visitation distribution, thereby encouraging uniform long-run coverage of the state space. Many existing exploration approaches require estimating state visitation frequencies through repeated on-policy rollouts, which can be computationally expensive. In this work, we instead consider an intrinsic average-reward formulation in which the reward is derived from the visitation distribution itself, so that the optimal policy maximizes steady-state entropy. An entropy-regularized version of this objective admits a spectral characterization: the relevant stationary distributions can be computed from the dominant eigenvectors of a problem-dependent transition matrix. This insight leads to a novel algorithm for solving the maximum entropy exploration problem, EVE (EigenVector-based Exploration), which avoids explicit rollouts and distribution estimation, instead computing the solution through iterative updates, similar to a value-based approach. To address the original unregularized objective, we employ a posterior-policy iteration (PPI) approach, which monotonically improves the entropy and converges in value. We prove convergence of EVE under standard assumptions and demonstrate empirically that it efficiently produces policies with high steady-state entropy, achieving competitive exploration performance relative to rollout-based baselines in deterministic grid-world environments.
Abstract（参考訳）: 効率的な探索は強化学習における中心的な課題であり、特に外部報酬関数が利用できない場合、データ収集に有用な事前学習目的として役立っている。探索問題の原則的定式化は、誘導された定常的訪問分布のエントロピーを最大化し、状態空間の均一な長期被覆を促進する政策を見つけることである。多くの既存探査手法では、繰り返しのオン・ポリケーション・ロールアウトを通じて国家訪問頻度を推定する必要があるが、これは計算に費用がかかる可能性がある。そこで本研究では,訪問分布自体から報酬を導出する本質的な平均回帰式を考えることにより,最適ポリシが定常エントロピーを最大化する。この目的のエントロピー規則化されたバージョンはスペクトル的特徴を認めており、関連する定常分布は問題依存遷移行列の支配的固有ベクトルから計算することができる。この洞察は、EVE(EigenVector-based Exploration)と呼ばれる最大エントロピー探索問題の解法に導かれる。元の非正規化目的に対処するために、エントロピーを単調に改善し、価値を収束させる、後続政治反復(PPI)アプローチを用いる。我々は,EVEの標準仮定による収束を実証し,高い定常エントロピーを持つ政策を効率よく生成し,決定論的グリッドワールド環境におけるロールアウトベースラインに対する競争的探索性能を達成することを実証した。

論文の概要: Maximum Entropy Exploration Without the Rollouts

関連論文リスト