Fugu-MT 論文翻訳(概要): Policy Mirror Descent for Reinforcement Learning: Linear Convergence, New Sampling Complexity, and Generalized Problem Classes

論文の概要: Policy Mirror Descent for Reinforcement Learning: Linear Convergence, New Sampling Complexity, and Generalized Problem Classes

arxiv url: http://arxiv.org/abs/2102.00135v2
Date: Tue, 2 Feb 2021 01:59:41 GMT
ステータス: 翻訳完了
システム内更新日: 2021-02-03 12:48:10.072726
Title: Policy Mirror Descent for Reinforcement Learning: Linear Convergence, New Sampling Complexity, and Generalized Problem Classes
Title（参考訳）: 強化学習のためのポリシーミラー降下:線形収束、新しいサンプリング複雑性、一般化問題クラス
Authors: Guanghui Lan
Abstract要約: 本稿では,強い凸あるいは一般凸正規化器を用いた強化学習問題の解法として,PMD法を提案する。私たちの知る限りでは、これらの開発は最適化と柔軟性の両面で新しくなっています。
参考スコア（独自算出の注目度）: 6.240369435223
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present new policy mirror descent (PMD) methods for solving reinforcement learning (RL) problems with either strongly convex or general convex regularizers. By exploring the structural properties of these overall seemly highly nonconvex problems we show that the PMD methods exhibit fast linear rate of convergence to the global optimality. We develop stochastic counterparts of these methods, and establish an ${\cal O}(1/\epsilon)$ (resp., ${\cal O}(1/\epsilon^2)$) sampling complexity for solving these RL problems with strongly (resp., general) convex regularizers using different sampling schemes, where $\epsilon$ denote the target accuracy. We further show that the complexity for computing the gradients of these regularizers, if necessary, can be bounded by ${\cal O}\{(\log_\gamma \epsilon) [(1-\gamma)L/\mu]^{1/2}\log (1/\epsilon)\}$ (resp., ${\cal O} \{(\log_\gamma \epsilon ) [(1-\gamma)L/\epsilon]^{1/2}\}$) for problems with strongly (resp., general) convex regularizers. Here $\gamma$ denotes the discounting factor. To the best of our knowledge, these complexity bounds, along with our algorithmic developments, appear to be new in both optimization and RL literature. The introduction of these convex regularizers also greatly expands the flexibility and applicability of RL models.
Abstract（参考訳）: 本稿では,強化学習(RL)問題を,強い凸あるいは一般凸正規化器を用いて解くための新しいポリシーミラー降下法を提案する。これらの全体的非凸問題の構造的性質を調べることにより、pmd法は、大域的最適性への収束速度が速いことを示した。これらの方法の確率的対応法を開発し、 ${\cal O}(1/\epsilon)$ (resp., ${\cal O}(1/\epsilon^2)$) のサンプリング複雑性を確立し、これらのRL問題を異なるサンプリングスキームを用いて強く(resp., general)凸正規化することで解決する。さらに,これらの正規化子の勾配を計算するための複雑性は,必要であれば,強い(一般)凸正規化子を持つ問題に対して,${\cal o}\{(\log_\gamma \epsilon) [(1-\gamma)l/\mu]^{1/2}\log (1/\epsilon)\}$ (resp., ${\cal o} \{(\log_\gamma \epsilon ) [(1-\gamma)l/\epsilon]^{1/2}\}$) で限定できることを示した。ここで$\gamma$は割引要因を表します。我々の知る限り、これらの複雑さはアルゴリズムの発達とともに、最適化とRLの文献の両方において新しく見える。これらの凸正規化器の導入は、rlモデルの柔軟性と適用性を大きく広げる。

関連論文リスト

On Linear Convergence in Smooth Convex-Concave Bilinearly-Coupled Saddle-Point Optimization: Lower Bounds and Optimal Algorithms [17.227158587717934]
滑らかな凸凹型双線型結合型サドル点問題である $min_xmax_y f(x) + langle y,mathbfB xrangle - g(y)$ を再検討する。この問題クラスに対して、第一低次複雑性境界と最適線形収束アルゴリズムを設計する。
論文参考訳（メタデータ） (2024-11-21T22:06:25Z)
On the Convergence of Single-Timescale Actor-Critic [49.19842488693726]
本研究では,有限状態空間を持つ無限水平割引決定過程(MD)に対して,単時間アクタークリティカル(AC)アルゴリズムのグローバル収束を解析する。我々は,アクタと批評家の両方のステップサイズが (O(k-Pfrac12) として崩壊し,従来の (O(k-Pfrac12) ) レートから (非最適) の Markov フレームワーク最適化で一般的に使用される (O(k-Pfrac12) ) レートから$k$ になることを示した。
論文参考訳（メタデータ） (2024-10-11T14:46:29Z)
Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
本稿では,円滑なベルマン作用素を持つ連続空間マルコフ決定過程(MDP)の一般クラスにおいて,$varepsilon$-optimal Policyを学習する問題を考察する。我々のソリューションの鍵となるのは、調和解析のアイデアに基づく新しい射影技術である。我々の結果は、連続空間 MDP における2つの人気と矛盾する視点のギャップを埋めるものである。
論文参考訳（メタデータ） (2024-05-10T09:58:47Z)
Polynomial-Time Solutions for ReLU Network Training: A Complexity Classification via Max-Cut and Zonotopes [70.52097560486683]
我々は、ReLUネットワークの近似の難しさがマックス・カッツ問題の複雑さを反映しているだけでなく、特定の場合において、それと完全に一致することを証明した。特に、$epsilonleqsqrt84/83-1approx 0.006$とすると、目的値に関して相対誤差$epsilon$でReLUネットワーク対象の近似グローバルデータセットを見つけることはNPハードであることが示される。
論文参考訳（メタデータ） (2023-11-18T04:41:07Z)
The Sample Complexity Of ERMs In Stochastic Convex Optimization [13.896417716930687]
実際に$tildeO(fracdepsilon+frac1epsilon2)$データポイントも十分であることを示す。さらに、この結果を一般化し、全ての凸体に対して同様の上界が成り立つことを示す。
論文参考訳（メタデータ） (2023-11-09T14:29:25Z)
Reinforcement Learning with General Utilities: Simpler Variance Reduction and Large State-Action Space [17.366915676628867]
一般用途における強化学習の課題について考察する。我々のアルゴリズムは、$tildemathcalO(epsilon-3)$と$tildemathcalO(epsilon-2)$サンプル複雑度を達成する。
論文参考訳（メタデータ） (2023-06-02T18:16:35Z)
Optimal Sample Complexity of Reinforcement Learning for Mixing Discounted Markov Decision Processes [1.0445957451908694]
マルコフ決定過程(MDP)における無限地平面割引報酬の最大化のための最適なサンプル複雑性理論を考える。多くの関心の応用において、最適ポリシー(または全てのポリシー)は混合を誘導する。我々の分析は、一般状態空間 MDP の RL 問題の研究に使用できるため、独立した関心を持つ再生型アイデアに基礎を置いている。
論文参考訳（メタデータ） (2023-02-15T05:43:17Z)
A Newton-CG based barrier-augmented Lagrangian method for general nonconvex conic optimization [53.044526424637866]
本稿では、2つの異なる対象の一般円錐最適化を最小化する近似二階定常点(SOSP)について検討する。特に、近似SOSPを見つけるためのNewton-CGベースの拡張共役法を提案する。
論文参考訳（メタデータ） (2023-01-10T20:43:29Z)
Optimal Gradient Sliding and its Application to Distributed Optimization Under Similarity [121.83085611327654]
積 $r:=p + q$, ここで$r$は$mu$-strong convex類似性である。エージェントの通信やローカルコールにマスターされた問題を解決する方法を提案する。提案手法は$mathcalO(sqrtL_q/mu)$法よりもはるかにシャープである。
論文参考訳（メタデータ） (2022-05-30T14:28:02Z)
Thinking Outside the Ball: Optimal Learning with Gradient Descent for Generalized Linear Stochastic Convex Optimization [37.177329562964765]
我々は凸リプシッツ損失を伴う線形予測、あるいはより一般に一般化線型形式の凸最適化問題を考える。この設定では、初期反復が明示的な正規化や投影を伴わずにグラディエント Descent (GD) を停止し、過大なエラーを最大$epsilon$で保証することを示した。しかし、標準球における一様収束は、$Theta (1/epsilon4)$サンプルを用いた最適下界学習を保証できることを示しているが、分布依存球における一様収束に依存している。
論文参考訳（メタデータ） (2022-02-27T09:41:43Z)
Bilinear Classes: A Structural Framework for Provable Generalization in RL [119.42509700822484]
Bilinear Classesは強化学習の一般化を可能にする新しい構造フレームワークである。このフレームワークは、サンプルの複雑さが達成可能な、ほとんどすべての既存のモデルを取り込んでいる。我々の主な成果は、双線形クラスのためのサンプル複雑性を持つRLアルゴリズムである。
論文参考訳（メタデータ） (2021-03-19T16:34:20Z)
Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
HSDMPGは、学習モデル上で過大なエラーの順序である$mathcalObig(1/sttnbig)$を達成可能であることを示す。損失係数について、HSDMPGは学習モデル上で過大なエラーの順序である$mathcalObig(1/sttnbig)$を達成できることを示す。
論文参考訳（メタデータ） (2020-09-18T02:18:44Z)
Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
学習者が生成モデル$Y = langle X,w* rangle + epsilon$から$n$のサンプルにアクセスできるような高次元頑健な線形回帰問題について検討する。 i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance, (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
論文参考訳（メタデータ） (2020-07-16T06:44:44Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。