Fugu-MT 論文翻訳(概要): Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs

論文の概要: Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs

arxiv url: http://arxiv.org/abs/2010.00587v3
Date: Mon, 3 Jan 2022 02:23:25 GMT
ステータス: 翻訳完了
システム内更新日: 2022-10-12 07:54:36.248232
Title: Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs
Title（参考訳）: 分散MDPのための最小限の最適強化学習
Authors: Jiafan He and Dongruo Zhou and Quanquan Gu
Abstract要約: UCBVI-$gamma$が$tildeObig(sqrtSAT/ (1-gamma)1.5big)$ regret, where $S$ is the number of state, $A$ is the number of action, $gamma$ is the discount factor, $T$ is the number of steps。さらに、ハードMDPのクラスを構築し、任意のアルゴリズムに対して、期待される後悔は少なくとも$tildeOmegabig(sqrtSAT/)であることを示す。
参考スコア（独自算出の注目度）: 99.59319332864129
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-$\gamma$, which is based on the \emph{optimism in the face of uncertainty principle} and the Bernstein-type bonus. We show that UCBVI-$\gamma$ achieves an $\tilde{O}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $\gamma$ is the discount factor and $T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $\tilde{\Omega}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$. Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-$\gamma$ is nearly minimax optimal for discounted MDPs.
Abstract（参考訳）: 表設定下でのマルコフ決定過程(MDP)の強化学習問題について検討した。そこで本研究では,不確実性原理とベルンシュタイン型ボーナスを用いたモデルベースアルゴリズムucbvi-$\gamma$を提案する。 ucbvi-$\gamma$は$\tilde{o}\big({\sqrt{sat}}/{(1-\gamma)^{1.5}}\big)$ regret、ただし$s$は状態の数、$a$はアクションの数、$\gamma$はディスカウント係数、$t$はステップ数である。さらに、ハードMDPのクラスを構築し、任意のアルゴリズムに対して、期待される後悔は少なくとも$\tilde{\Omega}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$であることを示す。上界は対数因子までのミニマックス下限と一致し, UCBVI-$\gamma$ はほぼ最小値であることを示す。

関連論文リスト

Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation [3.2703356989962518]
非線型関数近似を用いたモデルに基づく強化学習について検討する。本研究では,無限水平平均逆法と割引逆法の両方に有効である確率効率のよい値反復型アルゴリズムを開発した。
論文参考訳（メタデータ） (2024-06-19T15:29:14Z)
Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes [21.77276136591518]
我々はマルコフ決定過程(MDPs)のための証明可能なモデルフリー強化学習(RL)アルゴリズムを開発した。シミュレータ設定では,$widetildeO left(fracSAmathrmsp(h*)epsilon2+fracS2Amathrmsp(h*)epsilon2right)$サンプルを用いて,$epsilon$-optimal Policyを求める。
論文参考訳（メタデータ） (2023-06-28T17:43:19Z)
Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
本研究は,リスク許容度が$tau$のCVaR(Conditional Value at Risk)の目的に着目し,リスクに敏感な強化学習(RL)について検討する。ミニマックスCVaRの後悔率は$Omega(sqrttau-1AK)$で、$A$はアクションの数、$K$はエピソード数である。我々は,このアルゴリズムが連続性仮定の下で$widetilde O(tau-1sqrtSAK)$の最適後悔を達成し,一般に近似することを示す。
論文参考訳（メタデータ） (2023-02-07T02:22:31Z)
Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap [84.66885506098724]
本稿では,アダプティブ・マルチステップ・ブートストラップ (AMB) を用いた表層有限水平マルコフ決定過程 (MDP) のモデルフリーアルゴリズムを提案する。 AMBは,部分最適ギャップの逆の和でのみスケールする,ギャップ依存的後悔境界を達成できることを示す。また、AMB は $frac|Z_mul|Delta_min$ regret という追加の $frac|Z_mul|Delta_min$ を被っていることも示しています。
論文参考訳（メタデータ） (2021-02-09T07:46:34Z)
Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes [91.38793800392108]
本稿では,マルコフ決定過程(MDP)の遷移確率核が線形混合モデルである線形関数近似による強化学習について検討する。上記の線形混合 MDP に対して$textUCRL-VTR+$ という線形関数近似を用いた計算効率の良い新しいアルゴリズムを提案する。我々の知る限り、これらは線形関数近似を持つRLのための計算効率が良く、ほぼ最小のアルゴリズムである。
論文参考訳（メタデータ） (2020-12-15T18:56:46Z)
Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping [99.59319332864129]
本稿では,割引決定(MDP)のための強化学習について検討する。本稿では,特徴写像を利用した新しいアルゴリズムを提案し,$tilde O(dsqrtT/ (1-gamma)2)$ regretを求める。以上の結果から,提案した強化学習アルゴリズムは,最大1-γ-0.5$の係数でほぼ最適であることが示唆された。
論文参考訳（メタデータ） (2020-06-23T17:08:54Z)
$Q$-learning with Logarithmic Regret [60.24952657636464]
楽観的な$Q$は$mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of state, $A$ is the number of action, $H$ is the planning horizon, $T$ is the total number of steps, $Delta_min$ is the least sub-Optitimality gap。
論文参考訳（メタデータ） (2020-06-16T13:01:33Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。