Fugu-MT 論文翻訳(概要): Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

論文の概要: Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

arxiv url: http://arxiv.org/abs/2112.10935v1
Date: Tue, 21 Dec 2021 01:54:17 GMT
ステータス: 翻訳完了
システム内更新日: 2021-12-22 14:26:22.576462
Title: Nearly Optimal Policy Optimization with Stable at Any Time Guarantee
Title（参考訳）: 安定な時間保証による最適政策最適化
Authors: Tianhao Wu, Yunchang Yang, Han Zhong, Liwei Wang, Simon S. Du, Jiantao Jiao
Abstract要約: citetshani 2020optimisticのポリシーベースのメソッドは、$tildeO(sqrtSAH3K + sqrtAH4)$, $S$ is the number of state, $A$ is the number of action, $H$ is the horizon, $K$ is the number of episodes, $sqrtSH$ gap with the informationtheoretic lower bound $tildeOmega(sqrtSAH3)
参考スコア（独自算出の注目度）: 53.155554415415445
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. However, theoretical understanding of these methods remains insufficient. Even in the episodic (time-inhomogeneous) tabular setting, the state-of-the-art theoretical result of policy-based method in \citet{shani2020optimistic} is only $\tilde{O}(\sqrt{S^2AH^4K})$ where $S$ is the number of states, $A$ is the number of actions, $H$ is the horizon, and $K$ is the number of episodes, and there is a $\sqrt{SH}$ gap compared with the information theoretic lower bound $\tilde{\Omega}(\sqrt{SAH^3K})$. To bridge such a gap, we propose a novel algorithm Reference-based Policy Optimization with Stable at Any Time guarantee (\algnameacro), which features the property "Stable at Any Time". We prove that our algorithm achieves $\tilde{O}(\sqrt{SAH^3K} + \sqrt{AH^4})$ regret. When $S > H$, our algorithm is minimax optimal when ignoring logarithmic factors. To our best knowledge, RPO-SAT is the first computationally efficient, nearly minimax optimal policy-based algorithm for tabular RL.
Abstract（参考訳）: ポリシー最適化手法は強化学習(RL)アルゴリズムの最も広く使われているクラスの一つである。しかし、これらの方法の理論的理解は不十分である。表層的な(時間的不均一な)表層設定でさえ、政策に基づく方法の最先端理論的な結果が \citet{shani2020optimistic} においてのみ$\tilde{O}(\sqrt{S^2AH^4K})$である場合、$S$は状態の数、$A$は行動の数、$H$は地平線、$K$はエピソード数、$\sqrt{SH}$は情報理論上の下限である$\tilde{\Omega}(\sqrt{SAH^3K})$である。このようなギャップを埋めるため,我々は,"いつでも安定"特性を特徴とする,常に安定な参照型ポリシー最適化(\algnameacro)を提案する。我々のアルゴリズムが $\tilde{O}(\sqrt{SAH^3K} + \sqrt{AH^4})$ regret を達成することを証明している。 S > H$ の場合,アルゴリズムは対数因子を無視する場合に最適である。我々の知る限り、RPO-SATは表形式RLのための計算効率が良く、ほぼ最小限のポリシーベースのアルゴリズムである。

関連論文リスト

Actor-Critics Can Achieve Optimal Sample Efficiency [15.033410073144939]
我々は,$O(dH5 log|mathcalA|/epsilon2 + dH4 log|mathcalF|/epsilon2)$ trajectories のサンプル複雑度を得る新しいアクター批判アルゴリズムを提案する。我々はこれをHybrid RLの設定にまで拡張し、批評家をオフラインデータで初期化すると、純粋なオフラインやオンラインRLに比べてサンプル効率が向上することを示した。
論文参考訳（メタデータ） (2025-05-06T17:32:39Z)
Improved Robust Estimation for Erdős-Rényi Graphs: The Sparse Regime and Optimal Breakdown Point [3.793609515750114]
我々は、ErdHos-R'enyiランダムグラフのエッジ密度を強く推定する問題を、$G(n, dcirc/n)$で検討する。我々のアルゴリズムは2乗の総和階層に基づいている。
論文参考訳（メタデータ） (2025-03-05T21:45:17Z)
Linear Bandits on Ellipsoids: Minimax Optimal Algorithms [5.678465386088928]
作用の集合が楕円体であるような計算線形帯域を考える。この問題に対して、最初の既知のミニマックス最適アルゴリズムを提供する。実行には、時間$O(dT + d2 log(T/d) + d3)$とメモリ$O(d2)$のみが必要である。
論文参考訳（メタデータ） (2025-02-24T14:12:31Z)
Achieving Tractable Minimax Optimal Regret in Average Reward MDPs [19.663336027878408]
我々は,$widetildemathrmO(sqrtmathrmsp(h*) S A T)$のミニマックス最適後悔を伴う最初の抽出可能なアルゴリズムを提案する。注目すべきは、我々のアルゴリズムは$mathrmsp(h*)$に関する事前情報を必要としないことである。
論文参考訳（メタデータ） (2024-06-03T11:53:44Z)
Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path [80.60592344361073]
線形混合遷移カーネルを用いた最短経路(SSP)問題について検討する。エージェントは繰り返し環境と対話し、累積コストを最小化しながら特定の目標状態に到達する。既存の作業は、イテレーションコスト関数の厳密な下限や、最適ポリシーに対する期待長の上限を仮定することが多い。
論文参考訳（メタデータ） (2024-02-14T07:52:00Z)
Best-of-Both-Worlds Algorithms for Linear Contextual Bandits [11.94312915280916]
両世界のベスト・オブ・ワールドズ・アルゴリズムを$K$武器付き線形文脈包帯に対して検討する。我々のアルゴリズムは、敵対的体制と敵対的体制の両方において、ほぼ最適の後悔の限界を提供する。
論文参考訳（メタデータ） (2023-12-24T08:27:30Z)
Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes [21.77276136591518]
我々はマルコフ決定過程(MDPs)のための証明可能なモデルフリー強化学習(RL)アルゴリズムを開発した。シミュレータ設定では,$widetildeO left(fracSAmathrmsp(h*)epsilon2+fracS2Amathrmsp(h*)epsilon2right)$サンプルを用いて,$epsilon$-optimal Policyを求める。
論文参考訳（メタデータ） (2023-06-28T17:43:19Z)
Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
本研究は,リスク許容度が$tau$のCVaR(Conditional Value at Risk)の目的に着目し,リスクに敏感な強化学習(RL)について検討する。ミニマックスCVaRの後悔率は$Omega(sqrttau-1AK)$で、$A$はアクションの数、$K$はエピソード数である。我々は,このアルゴリズムが連続性仮定の下で$widetilde O(tau-1sqrtSAK)$の最適後悔を達成し,一般に近似することを示す。
論文参考訳（メタデータ） (2023-02-07T02:22:31Z)
Best Policy Identification in Linear MDPs [70.57916977441262]
縮退した線形マルコフ+デルタ決定における最適同定問題について, 生成モデルに基づく固定信頼度設定における検討を行った。複雑な非最適化プログラムの解としての下位境界は、そのようなアルゴリズムを考案する出発点として用いられる。
論文参考訳（メタデータ） (2022-08-11T04:12:50Z)
Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP [31.62899359543925]
線形MDPを用いた最短経路問題(SSP)に対する2つの新しい非回帰アルゴリズムを提案する。我々の最初のアルゴリズムは計算効率が高く、後悔すべき$widetildeOleft(sqrtd3B_star2T_star Kright)$を達成している。第2のアルゴリズムは計算的に非効率であるが、$T_starに依存しない$widetildeO(d3.5B_starsqrtK)$の最初の「水平な」後悔を実現する。
論文参考訳（メタデータ） (2021-12-18T06:47:31Z)
Logarithmic Regret from Sublinear Hints [76.87432703516942]
自然クエリモデルにより,アルゴリズムが$O(log T)$ regretsを$O(sqrtT)$ hintsで得ることを示す。また、$o(sqrtT)$ hintsは$Omega(sqrtT)$ regretより保証できないことも示しています。
論文参考訳（メタデータ） (2021-11-09T16:50:18Z)
Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
我々は,バンディットフィードバックを用いてオンライン学習を学習する。 learnerは、コスト/リワード関数が"pseudo-1d"構造を許可するゼロ次オラクルのみにアクセスできる。我々は、$T$がラウンドの数である任意のアルゴリズムの後悔のために$min(sqrtdT、T3/4)$の下限を示しています。ランダム化オンライングラデーション下降とカーネル化指数重み法を組み合わせた新しいアルゴリズムsbcalgを提案し,疑似-1d構造を効果的に活用する。
論文参考訳（メタデータ） (2021-02-15T08:16:51Z)
Sparse Convex Optimization via Adaptively Regularized Hard Thresholding [17.60502131429094]
本稿では,適応正規化ハードThresholding (ARHT) アルゴリズムを提案する。また、OMP with Replacement (OMPR) for general $f$, under the condition $s > s* frackappa24$。
論文参考訳（メタデータ） (2020-06-25T17:16:21Z)
$Q$-learning with Logarithmic Regret [60.24952657636464]
楽観的な$Q$は$mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of state, $A$ is the number of action, $H$ is the planning horizon, $T$ is the total number of steps, $Delta_min$ is the least sub-Optitimality gap。
論文参考訳（メタデータ） (2020-06-16T13:01:33Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。