Fugu-MT 論文翻訳(概要): Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity

論文の概要: Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity

arxiv url: http://arxiv.org/abs/2201.09457v3
Date: Thu, 27 Jan 2022 17:51:12 GMT
ステータス: 翻訳完了
システム内更新日: 2022-01-28 11:45:06.258975
Title: Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity
Title（参考訳）: 同所性政策ミラー降下:政策収束、暗黙的正則化、サンプル複雑性の改善
Authors: Yan Li, Tuo Zhao, Guanghui Lan
Abstract要約: 有限状態と作用空間を持つ割引・無限水平型MDPを解くホモトピーポリシーミラー降下法(HPMD)法。政策勾配法に関する文献では, 新たな3つの特性が報告されている。
参考スコア（独自算出の注目度）: 40.2022466644885
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We propose the homotopic policy mirror descent (HPMD) method for solving discounted, infinite horizon MDPs with finite state and action space, and study its policy convergence. We report three properties that seem to be new in the literature of policy gradient methods: (1) The policy first converges linearly, then superlinearly with order $\gamma^{-2}$ to the set of optimal policies, after $\mathcal{O}(\log(1/\Delta^*))$ number of iterations, where $\Delta^*$ is defined via a gap quantity associated with the optimal state-action value function; (2) HPMD also exhibits last-iterate convergence, with the limiting policy corresponding exactly to the optimal policy with the maximal entropy for every state. No regularization is added to the optimization objective and hence the second observation arises solely as an algorithmic property of the homotopic policy gradient method. (3) For the stochastic HPMD method, we further demonstrate a better than $\mathcal{O}(|\mathcal{S}| |\mathcal{A}| / \epsilon^2)$ sample complexity for small optimality gap $\epsilon$, when assuming a generative model for policy evaluation.
Abstract（参考訳）: 本稿では,有限状態と作用空間を持つ無限大地平線mdpを解くためのホモトピー・ポリシーミラー降下(hpmd)法を提案し,その政策収束について検討する。 We report three properties that seem to be new in the literature of policy gradient methods: (1) The policy first converges linearly, then superlinearly with order $\gamma^{-2}$ to the set of optimal policies, after $\mathcal{O}(\log(1/\Delta^*))$ number of iterations, where $\Delta^*$ is defined via a gap quantity associated with the optimal state-action value function; (2) HPMD also exhibits last-iterate convergence, with the limiting policy corresponding exactly to the optimal policy with the maximal entropy for every state. 最適化の目的に正規化は加えられず、従って第2の観測はホモトピーポリシー勾配法のアルゴリズム的性質としてのみ発生する。 (3) 確率HPMD法では、政策評価のための生成モデルを想定した場合、小さな最適性ギャップに対して、$\mathcal{O}(|\mathcal{S}| |\mathcal{A}| / \epsilon^2)$のサンプル複雑性よりも優れていることを示す。

関連論文リスト

Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
政策学習における強化学習について検討する。目的は、特定の種類の利害関係において最高の政策と競争力のある政策を見つけることである。
論文参考訳（メタデータ） (2025-07-06T14:40:05Z)
Rethinking the Global Convergence of Softmax Policy Gradient with Linear Function Approximation [52.772454746132276]
問題依存量のモデル化における近似誤差は,アルゴリズムのグローバル収束とは無関係であることを示す。我々は,任意の定値学習率を持つ$textttLin-SPG$が,最適ポリシーへのグローバル収束を保証することを証明した。
論文参考訳（メタデータ） (2025-05-06T04:03:06Z)
Convergence of Policy Mirror Descent Beyond Compatible Function Approximation [66.4260157478436]
我々は,より弱い変動支配を前提とした理論的PMD一般政策クラスを開発し,最良クラス政策への収束を得る。我々の主観念は、占有度-勾配測度によって誘導される局所ノルムによって誘導される新しい概念を活用する。
論文参考訳（メタデータ） (2025-02-16T08:05:46Z)
Provable Policy Gradient Methods for Average-Reward Markov Potential Games [15.136494705127564]
無限水平平均報酬基準の下でマルコフポテンシャルゲームを研究する。我々は、独立政策勾配と独立自然政策勾配に基づくアルゴリズムが、平均報酬基準に対するナッシュ均衡にグローバルに収束することを証明した。
論文参考訳（メタデータ） (2024-03-09T00:20:33Z)
Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs [107.28031292946774]
無限水平割引マルコフ決定過程(拘束型MDP)の最適ポリシの計算問題について検討する。我々は, 最適制約付きポリシーに反復的に対応し, 非漸近収束性を持つ2つの単一スケールポリシーに基づく原始双対アルゴリズムを開発した。我々の知る限り、この研究は制約付きMDPにおける単一時間スケールアルゴリズムの非漸近的な最後の収束結果となる。
論文参考訳（メタデータ） (2023-06-20T17:27:31Z)
Policy Mirror Descent Inherently Explores Action Space [10.772560347950053]
1/epsilon2)$tildemathcalO($tildemathcalO)を,探索戦略を使わずに,オンラインポリシグラデーションメソッドのサンプル複雑性を初めて確立しました。新しいポリシー勾配法は、最適なポリシーを探す際に、潜在的にリスクの高い行動に繰り返しコミットするのを防ぐことができる。
論文参考訳（メタデータ） (2023-03-08T05:19:08Z)
Linear Convergence of Natural Policy Gradient Methods with Log-Linear Policies [115.86431674214282]
我々は、無限水平割引マルコフ決定過程を考察し、自然政策勾配(NPG)とQ-NPG法の収束率を対数線形ポリシークラスで検討する。両手法が線形収束率と $mathcalO (1/epsilon2)$サンプル複雑度を, 単純で非適応的な幾何的に増加するステップサイズを用いて達成できることを示す。
論文参考訳（メタデータ） (2022-10-04T06:17:52Z)
Best Policy Identification in Linear MDPs [70.57916977441262]
縮退した線形マルコフ+デルタ決定における最適同定問題について, 生成モデルに基づく固定信頼度設定における検討を行った。複雑な非最適化プログラムの解としての下位境界は、そのようなアルゴリズムを考案する出発点として用いられる。
論文参考訳（メタデータ） (2022-08-11T04:12:50Z)
Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
最適化手法の優位性は、正確な勾配が用いられるかどうかに大きく依存することを示す。次に,政策最適化におけるコミット率の概念を紹介する。第三に、外部のオラクル情報がない場合には、収束を加速するために幾何を利用することと、最適性をほぼ確実に達成することとの間に本質的にトレードオフがあることが示される。
論文参考訳（メタデータ） (2021-10-29T06:35:44Z)
Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
Softmax Policy gradient(PG)メソッドは、現代の強化学習におけるポリシー最適化の事実上の実装の1つです。ソフトマックス PG 法は、$mathcalS|$ および $frac11-gamma$ の観点から指数時間で収束できることを実証する。
論文参考訳（メタデータ） (2021-02-22T18:56:26Z)
On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method [38.34416337932712]
政策は、例えばREINFORCEのようなリッチな強化学習(RL)手法を生み出します。しかし、そのようなメソッドが$epsilon$-optimal Policyを見つけるための最もよく知られたサンプルの複雑さは$mathcalO(epsilon-3)$である。第一次政策最適化法の基本収束特性とサンプル効率について検討する。
論文参考訳（メタデータ） (2021-02-17T07:06:19Z)
On the Global Convergence Rates of Softmax Policy Gradient Methods [45.1868906130788]
真の勾配では、ソフトマックスパラメトリゼーションによるポリシー勾配が$O(e-c cdot t)$レートで収束することを示す。第2に,エントロピー規則化政策勾配を解析し,より高速な線形収束率を示す。第3に、エントロピー正則化が真の勾配であっても、政策最適化をどのように改善するかを説明する。
論文参考訳（メタデータ） (2020-05-13T16:01:39Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。