Fugu-MT 論文翻訳(概要): Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients

論文の概要: Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients

arxiv url: http://arxiv.org/abs/2605.10909v1
Date: Mon, 11 May 2026 17:49:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:51.048943
Title: Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients
Title（参考訳）: 制限された政策クラスに対する政策グラディエントの再検討:$k$-step政策グラディエントによるミオピック局所オプティマスの脱出
Authors: Alex DeWeese, Guannan Qu,
Abstract要約: この研究は、制限されたポリシークラスで使用される標準ポリシー勾配メソッドを再考する。一般化された$k$-stepポリシー勾配法を提案し,そのランダム性を$k$-step時間ウィンドウ内で結合する。本手法は,最適決定性ポリシーに指数関数的に近い解に収束することが理論的に保証されていることを示す。
参考スコア（独自算出の注目度）: 8.64427265159929
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work revisits standard policy gradient methods used on restricted policy classes, which are known to get stuck in suboptimal critical points. We identify an important cause for this phenomenon to be that the policy gradient is itself fundamentally myopic, i.e. it only improves the policy based on the one-step $Q$-function. In this work, we propose a generalized $k$-step policy gradient method that couples the randomness within a $k$-step time window and can escape the myopic local optima in MDPs with restricted policy classes. We show this new method is theoretically guaranteed to converge to a solution that is exponentially close in performance to the optimal deterministic policy with respect to $k$. Further, we show projected gradient descent and mirror descent with this $k$-step policy gradient can achieve this exponential guarantee in $O(\frac{1}{T})$ iterations, despite only assuming smoothness and differentiability of the value function. This will provide near optimal solutions to previously elusive applications like state aggregation and partially observable cooperative multi-agent settings. Moreover, our bounds avoid the ubiquitous distribution mismatch factors $||d_μ^{π^*} / d_μ^π||_\infty$ and $||d_μ^{π^*} / μ||_\infty$ enabling the $k$-step policy gradient method to escape suboptimal critical points that emerge from poor exploration in fully observable settings.
Abstract（参考訳）: この研究は、制限された政策クラスで使用される標準の政策勾配法を再検討し、これは最適下限臨界点で立ち往生することが知られている。この現象の重要な原因は、政策勾配自体が基本的にミオピックであること、すなわち、一段階の$Q$-関数に基づいて政策を改善することである。そこで本研究では,制限されたポリシークラスを持つMDPにおいて,ランダム性を$k$-step時間ウィンドウ内で結合し,ミオロピック局所最適化を回避可能な,一般化された$k$-stepポリシー勾配法を提案する。我々は、この新手法が、$k$に対する最適決定論ポリシーに指数関数的に近い解に収束することが理論的に保証されていることを示す。さらに、この$k$ステップのポリシー勾配で投影された勾配降下とミラー降下は、値関数の滑らかさと微分性のみを仮定するにもかかわらず、$O(\frac{1}{T})$繰り返しにおいてこの指数的な保証を達成することができることを示す。これにより、状態アグリゲーションや部分的に観察可能な協調型マルチエージェント設定といった、先例のないアプリケーションに対して、ほぼ最適なソリューションが提供される。さらに、我々の境界は、ユビキタス分布のミスマッチ因子 $|d_μ^{π^*} / d_μ^π||_\infty$ と $|d_μ^{π^*} / μ||_\infty$ を回避し、フルオブザーバブル環境での探索不良から生じる準最適臨界点の回避を可能にする。

論文の概要: Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients

関連論文リスト