Fugu-MT 論文翻訳(概要): Sample Complexity of Policy Gradient Finding Second-Order Stationary Points

論文の概要: Sample Complexity of Policy Gradient Finding Second-Order Stationary Points

arxiv url: http://arxiv.org/abs/2012.01491v1
Date: Wed, 2 Dec 2020 19:54:10 GMT
ステータス: 翻訳完了
システム内更新日: 2021-05-30 02:22:59.756834
Title: Sample Complexity of Policy Gradient Finding Second-Order Stationary Points
Title（参考訳）: 2次定常点を求める政策勾配のサンプル複雑度
Authors: Long Yang, Qian Zheng, Gang Pan
Abstract要約: 政策勾配が$(epsilon,sqrtepsilonchi)$-SOSPに収束することを示す。この手法は、潜在的に広範なポリシー勾配法に一般化することができる。
参考スコア（独自算出の注目度）: 21.631118289693653
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The goal of policy-based reinforcement learning (RL) is to search the maximal point of its objective. However, due to the inherent non-concavity of its objective, convergence to a first-order stationary point (FOSP) can not guarantee the policy gradient methods finding a maximal point. A FOSP can be a minimal or even a saddle point, which is undesirable for RL. Fortunately, if all the saddle points are \emph{strict}, all the second-order stationary points (SOSP) are exactly equivalent to local maxima. Instead of FOSP, we consider SOSP as the convergence criteria to character the sample complexity of policy gradient. Our result shows that policy gradient converges to an $(\epsilon,\sqrt{\epsilon\chi})$-SOSP with probability at least $1-\widetilde{\mathcal{O}}(\delta)$ after the total cost of $\mathcal{O}\left(\dfrac{\epsilon^{-\frac{9}{2}}}{(1-\gamma)\sqrt\chi}\log\dfrac{1}{\delta}\right)$, where $\gamma\in(0,1)$. Our result improves the state-of-the-art result significantly where it requires $\mathcal{O}\left(\dfrac{\epsilon^{-9}\chi^{\frac{3}{2}}}{\delta}\log\dfrac{1}{\epsilon\chi}\right)$. Our analysis is based on the key idea that decomposes the parameter space $\mathbb{R}^p$ into three non-intersected regions: non-stationary point, saddle point, and local optimal region, then making a local improvement of the objective of RL in each region. This technique can be potentially generalized to extensive policy gradient methods.
Abstract（参考訳）: 政策に基づく強化学習(RL)の目的は、その目的の最大点を探索することである。しかし、その目的の固有の非凸性のため、一階定常点 (FOSP) への収束は、極大点を求める政策勾配法を保証できない。 fosp は rl では望ましくない極小あるいは極小の saddle point であってもよい。幸いなことに、すべてのサドル点が \emph{strict} であれば、二階定常点 (SOSP) はすべて局所最大値と全く同じである。 fospの代わりに、政策勾配のサンプル複雑性を特徴付ける収束基準としてsospを考える。その結果、ポリシー勾配は$(\epsilon,\sqrt{\epsilon\chi})$-sosp に収束し、$\mathcal{o}\left(\dfrac{\epsilon^{-\frac{9}{2}}}{(1-\gamma)\sqrt\chi}\log\dfrac{1}{\delta}\right)$,ただし $\gamma\in(0,1)$ の合計コストの後に少なくとも 1-\widetilde{\mathcal{o}}(\delta)$ となる。我々の結果は、$\mathcal{O}\left(\dfrac{\epsilon^{-9}\chi^{\frac{3}{2}}}{\delta}\log\dfrac{1}{\epsilon\chi}\right)$が要求される最先端の結果を大幅に改善する。我々の分析は、パラメータ空間 $\mathbb{R}^p$ を非定常点、サドル点、局所最適領域の3つの非交差領域に分解し、各領域におけるRLの目的を局所的に改善するというキーアイデアに基づいている。この手法は広範な政策勾配法に応用できる可能性がある。

関連論文リスト

Near-Optimal Distributed Minimax Optimization under the Second-Order Similarity [22.615156512223763]
本研究では,有限サム構造を目的とする分散楽観的すべり(SVOGS)法を提案する。我々は$mathcal O(delta D2/varepsilon)$、$mathcal O(n+sqrtndelta D2/varepsilon)$、$tildemathcal O(n+sqrtndelta+L)D2/varepsilon)$の局所呼び出しを証明している。
論文参考訳（メタデータ） (2024-05-25T08:34:49Z)
Multiple-policy Evaluation via Density Estimation [30.914344538340412]
本稿では,この問題に対して$mathrmCAESAR$というアルゴリズムを提案する。低次かつ対数的な$mathrmCAESAR$は、$tildeOleft(fracH4epsilon2sum_h=1Hmax_kin[K]sum_s,afrac(d_hpik(s,a))2mu*_h(s,a)right)$である。
論文参考訳（メタデータ） (2024-03-29T23:55:25Z)
Provable Policy Gradient Methods for Average-Reward Markov Potential Games [15.136494705127564]
無限水平平均報酬基準の下でマルコフポテンシャルゲームを研究する。我々は、独立政策勾配と独立自然政策勾配に基づくアルゴリズムが、平均報酬基準に対するナッシュ均衡にグローバルに収束することを証明した。
論文参考訳（メタデータ） (2024-03-09T00:20:33Z)
Algorithms for mean-field variational inference via polyhedral optimization in the Wasserstein space [10.292118864147097]
ワッサーシュタイン空間上の有限次元多面体部分集合の理論を開発し、一階法による函数の最適化を行う。我々の主な応用は平均場変動推論の問題であり、これは分布の$pi$ over $mathbbRd$を製品測度$pistar$で近似しようとするものである。解析の副産物として,MFVIのための勾配に基づくアルゴリズムの最初のエンドツーエンド解析を求める。
論文参考訳（メタデータ） (2023-12-05T16:02:04Z)
Differentially Private Non-Convex Optimization under the KL Condition with Optimal Rates [33.46648653842542]
我々は,KL(Kurdyka-Lojasiewicz)条件を満たす損失に対する個人的経験的リスク問題について検討した。近似点法のプライベート実装で$tildeObig(big(fracsqrtdnsqrtrhobig)kappabig)を実現できることを示す。
論文参考訳（メタデータ） (2023-11-22T15:12:42Z)
Convergence of a Normal Map-based Prox-SGD Method under the KL Inequality [0.0]
我々は、$symbol$k$収束問題に対して、新しいマップベースのアルゴリズム(mathsfnorMtext-mathsfSGD$)を提案する。
論文参考訳（メタデータ） (2023-05-10T01:12:11Z)
ReSQueing Parallel and Private Stochastic Convex Optimization [59.53297063174519]
本稿では,BFG凸最適化(SCO: Reweighted Query (ReSQue) 推定ツールを提案する。我々はSCOの並列およびプライベート設定における最先端の複雑さを実現するアルゴリズムを開発した。
論文参考訳（メタデータ） (2023-01-01T18:51:29Z)
Near Sample-Optimal Reduction-based Policy Learning for Average Reward MDP [58.13930707612128]
この研究は、平均報酬マルコフ決定過程(AMDP)における$varepsilon$-Optimal Policyを得る際のサンプルの複雑さを考察する。我々は、状態-作用対当たりの$widetilde O(H varepsilon-3 ln frac1delta)$サンプルを証明し、$H := sp(h*)$は任意の最適ポリシーのバイアスのスパンであり、$varepsilon$は精度、$delta$は失敗確率である。
論文参考訳（メタデータ） (2022-12-01T15:57:58Z)
On the Complexity of Decentralized Smooth Nonconvex Finite-Sum Optimization [21.334985032433778]
分散最適化問題 $min_bf xinmathbb Rd f(bf x)triq frac1msum_i=1m f_i(bf x)triq frac1nsum_j=1n。
論文参考訳（メタデータ） (2022-10-25T11:37:11Z)
Best Policy Identification in Linear MDPs [70.57916977441262]
縮退した線形マルコフ+デルタ決定における最適同定問題について, 生成モデルに基づく固定信頼度設定における検討を行った。複雑な非最適化プログラムの解としての下位境界は、そのようなアルゴリズムを考案する出発点として用いられる。
論文参考訳（メタデータ） (2022-08-11T04:12:50Z)
Faster Rates of Convergence to Stationary Points in Differentially Private Optimization [31.46358820048179]
リプシッツの定常点と滑らかな関数を$(varepsilon,delta)$-differential privacy(DP)で近似する問題について検討する。点 $widehatw$ は関数 $mathbbRdrightarrowmathbbR$ if $|nabla F(widehatw)|leq alpha$ の $alpha$-stationary point と呼ばれる。我々は$tildeObig(big[)を見つける新しい効率的なアルゴリズムを提供する。
論文参考訳（メタデータ） (2022-06-02T02:43:44Z)
A first-order primal-dual method with adaptivity to local smoothness [64.62056765216386]
凸凹対象 $min_x max_y f(x) + langle Ax, yrangle - g*(y)$, ここで、$f$ は局所リプシッツ勾配を持つ凸関数であり、$g$ は凸かつ非滑らかである。主勾配ステップと2段ステップを交互に交互に行うCondat-Vuアルゴリズムの適応バージョンを提案する。
論文参考訳（メタデータ） (2021-10-28T14:19:30Z)
Faster Perturbed Stochastic Gradient Methods for Finding Local Minima [92.99933928528797]
局所最小値を求めるための高速な摂動勾配フレームワークであるtttPullbackを提案する。 SARAH/SP や STORM のような勾配推定器を用いたプルバックは $(epsilon, epsilon_H)$approximate local minima を $tilde O(epsilon-3 + H-6)$ 内で見つけることができる。我々のフレームワークの中核となる考え方は、勾配評価の平均運動を制御するステップサイズのプルバック方式である。
論文参考訳（メタデータ） (2021-10-25T07:20:05Z)
Gap-Dependent Unsupervised Exploration for Reinforcement Learning [40.990467706237396]
タスクに依存しない強化学習のための効率的なアルゴリズムを提案する。このアルゴリズムは1/epsilon cdot (H3SA / rho + H4 S2 A) の$widetildemathcalOのみを探索する。情報理論上、この境界は$rho Theta (1/(HS))$と$H>1$に対してほぼ厳密であることを示す。
論文参考訳（メタデータ） (2021-08-11T20:42:46Z)
Nearly Horizon-Free Offline Reinforcement Learning [97.36751930393245]
S$状態、$A$アクション、計画的地平$H$で、エピソードな時間同質なMarkov決定プロセスに関するオフライン強化学習を再考する。経験的MDPを用いた評価と計画のための,約$H$自由なサンプル複雑性境界の最初の集合を得る。
論文参考訳（メタデータ） (2021-03-25T18:52:17Z)
Nearly Minimax Optimal Reward-free Reinforcement Learning [88.75843804630772]
本稿では、特にバッチ強化学習に適した報酬不要強化学習フレームワークと、複数の報酬関数に対するポリシーを必要とするシナリオについて検討する。 textbfStaged textbfSampling + textbfTruncated textbfPlanning (algoname) という新しい効率的なアルゴリズムを提供しています。
論文参考訳（メタデータ） (2020-10-12T17:51:19Z)
Safe Learning under Uncertain Objectives and Constraints [66.05180398174286]
我々は、テキスト不明で安全クリティカルな制約の下で、非テクスト無知かつ安全クリティカルな最適化問題を考察する。このような問題は、ロボティクス、製造、医療などの様々な領域で自然に発生する。我々の分析の重要な要素は、安全な最適化の文脈で収縮と呼ばれる手法を導入し、適用することである。
論文参考訳（メタデータ） (2020-06-23T20:51:00Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。