Fugu-MT 論文翻訳(概要): Global linear convergence of entropy-regularized softmax policy gradient beyond tabular MDPs

論文の概要: Global linear convergence of entropy-regularized softmax policy gradient beyond tabular MDPs

arxiv url: http://arxiv.org/abs/2605.24939v1
Date: Sun, 24 May 2026 08:38:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:18.526778
Title: Global linear convergence of entropy-regularized softmax policy gradient beyond tabular MDPs
Title（参考訳）: 表状MDPを超えるエントロピー規則化ソフトマックス政策勾配の大域的線形収束
Authors: Ziyue Chen, David Šiška, Lukasz Szpruch,
Abstract要約: 無限水平エントロピー規則化マルコフ決定過程(MDPs)に対する政策勾配の連続状態と行動空間とのグローバル収束性について検討する。
参考スコア（独自算出の注目度）: 3.2702446666873026
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the global convergence of policy gradient for infinite-horizon entropy-regularized Markov decision processes (MDPs) with continuous state and action spaces. We consider log-linear softmax policies with linear function approximation, which extend the tabular softmax parameterization while retaining a tractable policy class. Under $Q^π_τ$-realizability for the regularized state-action value function, we first establish a non-uniform Polyak--Łojasiewicz (PŁ) inequality. The non-uniformity arises through degeneracy of constants associated with the policy geometry, namely the Fisher information matrix or an uncentered feature covariance matrix. We then identify two feature regimes under which this non-uniform constant can be bounded along the gradient flow. For full-affine-span features, we prove radial unboundedness of the KL regularizer and show that the smallest eigenvalue of the Fisher information matrix remains bounded below by an initialization-dependent positive constant. For simplex-valued features, we prove an analogous radial unboundedness result in the subspace orthogonal to the all-ones vector and obtain a uniform lower bound for the smallest eigenvalue of the uncentered covariance matrix. These results imply global linear convergence of the regularized objective along the gradient flow, i.e. suboptimality decaying as $\mathcal{O}(e^{-Ct})$ for some $C>0$. Our analysis extends the global convergence theory of entropy-regularized softmax policy gradient beyond the tabular setting of Agarwal et al. (2020); Bhandari and Russo (2024); Mei et al. (2020).
Abstract（参考訳）: 無限水平エントロピー規則化マルコフ決定過程(MDPs)に対する政策勾配の連続状態と行動空間とのグローバル収束性について検討する。本稿では,線形関数近似を用いた対数線形ソフトマックスポリシーについて考察する。正規化状態-作用値関数の$Q^π_τ$-実現性の下では、まず非一様ポリアック-ジョジャシエヴィチの不等式を確立する。非均一性は、ポリシー幾何学、すなわちフィッシャー情報行列または非中心的特徴共分散行列に関連する定数の退化によって生じる。次に、この非一様定数を勾配流に沿って有界にすることができる2つの特徴レギュレーションを同定する。フルアフィン・スパンの特徴について、KL正則化器の放射的非有界性を証明し、フィッシャー情報行列の最小固有値が初期化依存正の定数で下界していることを示す。単純値の特徴に対して、全和ベクトルに直交する部分空間における類似のラジアル非有界性(英語版)(radial unboundedness)を証明し、非中心共分散行列の最小固有値に対する一様下界を求める。これらの結果は、勾配流に沿った正規化対象の大域的線型収束、すなわち、ある$C>0$に対して$\mathcal{O}(e^{-Ct})$として崩壊する部分最適性を意味する。我々の分析は、Agarwal et al (2020)、Bhandari and Russo (2024)、Mei et al (2020) の表層設定を超えて、エントロピー規則化されたソフトマックス政策勾配の大域収束理論を拡張した。

関連論文リスト

Regularized Online RLHF with Generalized Bilinear Preferences [68.44113000390544]
一般的な嗜好を伴う文脈的オンラインRLHFの問題を考える。一般化された双線形選好モデルを用いて、低ランクなスキュー対称行列による選好を捉える。グリーディポリシーの双対ギャップは推定誤差の正方形によって有界であることを示す。
論文参考訳（メタデータ） (2026-02-26T15:27:53Z)
Provably Efficient Algorithms for S- and Non-Rectangular Robust MDPs with General Parameterization [85.91302339486673]
我々は、s-正方形および非正方形不確実性集合の下で、一般的な政策パラメータ化を伴うロバストマルコフ決定過程(RMDP)について検討する。無限状態空間に拡張する一般政策パラメタライゼーションに対する新しいリプシッツ・リプシッツ・スムースネス特性を証明した。本研究では,S-正方形不確かさに対する勾配降下アルゴリズムと非正方形不確かさに対するFrank-Wolfeアルゴリズムを設計する。
論文参考訳（メタデータ） (2026-02-11T21:44:20Z)
Achieve Performatively Optimal Policy for Performative Reinforcement Learning [55.983627302691424]
本研究は,0階次FrankWolfe- (0FW) アルゴリズムを提案する。実験結果から, 所望のPOポリシを求める場合, 既存の近似よりも0FWの方が有効であることが示唆された。
論文参考訳（メタデータ） (2025-10-06T01:56:31Z)
Convergence of policy gradient methods for finite-horizon exploratory linear-quadratic control problems [3.8661825615213012]
有限水平連続時間探索線形四元数制御(LQC)問題に対する政策勾配法(PG法)の大域的線形収束について検討する。本稿では,離散時間ポリシーを持つ新しいPG法を提案する。このアルゴリズムは連続時間解析を活用し,動作周波数の異なる線形収束性を実現する。
論文参考訳（メタデータ） (2022-11-01T17:31:41Z)
Linear Convergence of Natural Policy Gradient Methods with Log-Linear Policies [115.86431674214282]
我々は、無限水平割引マルコフ決定過程を考察し、自然政策勾配(NPG)とQ-NPG法の収束率を対数線形ポリシークラスで検討する。両手法が線形収束率と $mathcalO (1/epsilon2)$サンプル複雑度を, 単純で非適応的な幾何的に増加するステップサイズを用いて達成できることを示す。
論文参考訳（メタデータ） (2022-10-04T06:17:52Z)
On the Convergence Rates of Policy Gradient Methods [9.74841674275568]
有限状態部分空間における幾何的に割引された支配問題を考える。試料中の直交勾配のパラリゼーションにより、勾配の一般的な複雑さを解析できることが示される。
論文参考訳（メタデータ） (2022-01-19T07:03:37Z)
Convergence of policy gradient for entropy regularized MDPs with neural network approximation in the mean-field regime [0.0]
無限水平連続状態および行動空間,エントロピー規則化マルコフ決定過程(MDPs)に対する政策勾配のグローバル収束性について検討する。結果は非線形フォッカー-プランク-コルモゴロフ方程式の慎重な解析に依存する。
論文参考訳（メタデータ） (2022-01-18T20:17:16Z)
Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
Softmax Policy gradient(PG)メソッドは、現代の強化学習におけるポリシー最適化の事実上の実装の1つです。ソフトマックス PG 法は、$mathcalS|$ および $frac11-gamma$ の観点から指数時間で収束できることを実証する。
論文参考訳（メタデータ） (2021-02-22T18:56:26Z)
On Linear Stochastic Approximation: Fine-grained Polyak-Ruppert and Non-Asymptotic Concentration [115.1954841020189]
The inequality and non-asymptotic properties of approximation procedure with Polyak-Ruppert averaging。一定のステップサイズと無限大となる反復数を持つ平均的反復数に対する中心極限定理(CLT)を証明する。
論文参考訳（メタデータ） (2020-04-09T17:54:18Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。