Fugu-MT 論文翻訳(概要): Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs

論文の概要: Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs

arxiv url: http://arxiv.org/abs/2403.11477v1
Date: Mon, 18 Mar 2024 04:52:11 GMT
ステータス: 翻訳完了
システム内更新日: 2024-03-19 16:36:25.766205
Title: Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs
Title（参考訳）: Span-based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs
Authors: Matthew Zurek, Yudong Chen,
Abstract要約: 平均回帰マルコフ決定過程 (MDP) における$epsilon$-optimal Policy の学習の複雑さについて, 生成モデルを用いて検討した。 MDP を弱通信するためには、$tildeO(SAfracHepsilon2)$, $H$ は最適ポリシーのバイアス関数のスパンであり、$SA$ は状態-作用空間の濃度である。
参考スコア（独自算出の注目度）: 6.996002801232415
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the sample complexity of learning an $\epsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model. For weakly communicating MDPs, we establish the complexity bound $\tilde{O}(SA\frac{H}{\epsilon^2})$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space. Our result is the first that is minimax optimal (up to log factors) in all parameters $S,A,H$ and $\epsilon$, improving on existing work that either assumes uniformly bounded mixing times for all policies or has suboptimal dependence on the parameters. We further investigate sample complexity in general (non-weakly-communicating) average-reward MDPs. We argue a new transient time parameter $B$ is necessary, establish an $\tilde{O}(SA\frac{B+H}{\epsilon^2})$ complexity bound, and prove a matching (up to log factors) minimax lower bound. Both results are based on reducing the average-reward MDP to a discounted MDP, which requires new ideas in the general setting. To establish the optimality of this reduction, we develop improved bounds for $\gamma$-discounted MDPs, showing that $\tilde{\Omega}\left(SA\frac{H}{(1-\gamma)^2\epsilon^2}\right)$ samples suffice to learn an $\epsilon$-optimal policy in weakly communicating MDPs under the regime that $\gamma\geq 1-1/H$, and $\tilde{\Omega}\left(SA\frac{B+H}{(1-\gamma)^2\epsilon^2}\right)$ samples suffice in general MDPs when $\gamma\geq 1-\frac{1}{B+H}$. Both these results circumvent the well-known lower bound of $\tilde{\Omega}\left(SA\frac{1}{(1-\gamma)^3\epsilon^2}\right)$ for arbitrary $\gamma$-discounted MDPs. Our analysis develops upper bounds on certain instance-dependent variance parameters in terms of the span and transient time parameters. The weakly communicating bounds are tighter than those based on the mixing time or diameter of the MDP and may be of broader use.
Abstract（参考訳）: 平均回帰マルコフ決定過程(MDP)における$\epsilon$-optimal Policyを生成モデルで学習する際のサンプル複雑性について検討した。 MDP を弱通信するためには、$\tilde{O}(SA\frac{H}{\epsilon^2})$, $H$ は最適ポリシーのバイアス関数のスパンであり、$SA$ は状態-作用空間の濃度である。我々の結果は、すべてのパラメータにおいて(ログファクタまで)最小限の最適値である$S,A,H$および$\epsilon$で、すべてのポリシーに対して一様に有界な混合時間を仮定する既存の作業を改善するか、パラメータに最適に依存するかのいずれかである。さらに,一般(非弱交換型)平均回帰MDPにおけるサンプルの複雑さについて検討した。我々は、新しい過渡時間パラメータ$B$が必要であり、$\tilde{O}(SA\frac{B+H}{\epsilon^2})$複雑さ境界を確立し、マッチング(対数因子まで)の最小限境界を証明する。両結果は, 概して新しい考え方を必要とする, 平均回帰MDPを割引MDPに還元することに基づいている。この削減の最適性を確立するために、$\tilde{\Omega}\left(SA\frac{H}{(1-\gamma)^2\epsilon^2}\right)$サンプルsuffice to learn a $\epsilon$-optimal policy in weakly communication MDPs that the regime that $\gamma\geq 1-1/H$, and $\tilde{\Omega}\left(SA\frac{B+H}{(1-\gamma)^2\epsilon^2}\right)$サンプルsuffice in general MDPs when $\gamma\geq 1-\frac{1}{B+H}$. これらの結果は、任意の$\gamma$-discounted MDPに対して、よく知られた$\tilde{\Omega}\left(SA\frac{1}{(1-\gamma)^3\epsilon^2}\right)$の下位境界を回避している。本分析では, 時間的パラメータと時間的パラメータの関係から, 特定のインスタンス依存分散パラメータの上限を導出する。弱い連接境界は、MDPの混合時間や直径に基づいて、より密接であり、より広い用途がある可能性がある。

関連論文リスト

Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
本稿では,円滑なベルマン作用素を持つ連続空間マルコフ決定過程(MDP)の一般クラスにおいて,$varepsilon$-optimal Policyを学習する問題を考察する。我々のソリューションの鍵となるのは、調和解析のアイデアに基づく新しい射影技術である。我々の結果は、連続空間 MDP における2つの人気と矛盾する視点のギャップを埋めるものである。
論文参考訳（メタデータ） (2024-05-10T09:58:47Z)
Span-Based Optimal Sample Complexity for Average Reward MDPs [6.996002801232415]
平均回帰マルコフ決定過程(MDP)において,$varepsilon$-optimal Policyを生成モデルで学習する際のサンプル複雑性について検討した。我々は、$widetildeOleft(SAfracH (1-gamma)2varepsilon2 right)$, ここで、$H$は最適ポリシーのバイアス関数のスパンであり、$SA$は状態作用空間の濃度である。
論文参考訳（メタデータ） (2023-11-22T15:34:44Z)
Near Sample-Optimal Reduction-based Policy Learning for Average Reward MDP [58.13930707612128]
この研究は、平均報酬マルコフ決定過程(AMDP)における$varepsilon$-Optimal Policyを得る際のサンプルの複雑さを考察する。我々は、状態-作用対当たりの$widetilde O(H varepsilon-3 ln frac1delta)$サンプルを証明し、$H := sp(h*)$は任意の最適ポリシーのバイアスのスパンであり、$varepsilon$は精度、$delta$は失敗確率である。
論文参考訳（メタデータ） (2022-12-01T15:57:58Z)
Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
報酬混合マルコフ決定過程(RMMDP)におけるエピソード強化学習の検討我々のゴールは、そのようなモデルにおける時間段階の累積報酬をほぼ最大化する、ほぼ最適に近いポリシーを学ぶことである。
論文参考訳（メタデータ） (2022-10-05T22:52:00Z)
Best Policy Identification in Linear MDPs [70.57916977441262]
縮退した線形マルコフ+デルタ決定における最適同定問題について, 生成モデルに基づく固定信頼度設定における検討を行った。複雑な非最適化プログラムの解としての下位境界は、そのようなアルゴリズムを考案する出発点として用いられる。
論文参考訳（メタデータ） (2022-08-11T04:12:50Z)
Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model [3.749193647980305]
本稿では,一連の状態対応機能を有するマルコフ決定プロセス(MDP)について考察する。モデルに基づくアプローチ(resp.$Q-learning)が、高い確率で$varepsilon$-Optimalポリシーを確実に学習することを示す。
論文参考訳（メタデータ） (2021-05-28T17:49:39Z)
SDP Achieves Exact Minimax Optimality in Phase Synchronization [19.909352968029584]
我々は、ノイズ測定$Y=z*z*+sigma WinmathbbCntimes ntimes nで位相同期問題を研究する。 SDPが誤差境界$ (1+o)fracnp22np$を2乗の$ell$損失で達成することを証明する。
論文参考訳（メタデータ） (2021-01-07T03:14:05Z)
Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes [91.38793800392108]
本稿では,マルコフ決定過程(MDP)の遷移確率核が線形混合モデルである線形関数近似による強化学習について検討する。上記の線形混合 MDP に対して$textUCRL-VTR+$ という線形関数近似を用いた計算効率の良い新しいアルゴリズムを提案する。我々の知る限り、これらは線形関数近似を持つRLのための計算効率が良く、ほぼ最小のアルゴリズムである。
論文参考訳（メタデータ） (2020-12-15T18:56:46Z)
Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs [99.59319332864129]
UCBVI-$gamma$が$tildeObig(sqrtSAT/ (1-gamma)1.5big)$ regret, where $S$ is the number of state, $A$ is the number of action, $gamma$ is the discount factor, $T$ is the number of steps。さらに、ハードMDPのクラスを構築し、任意のアルゴリズムに対して、期待される後悔は少なくとも$tildeOmegabig(sqrtSAT/)であることを示す。
論文参考訳（メタデータ） (2020-10-01T17:57:47Z)
Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity [59.34067736545355]
S$状態、$A$アクション、割引係数$gamma in (0,1)$、近似しきい値$epsilon > 0$の MDP が与えられた場合、$epsilon$-Optimal Policy を学ぶためのモデルなしアルゴリズムを提供する。十分小さな$epsilon$の場合、サンプルの複雑さで改良されたアルゴリズムを示す。
論文参考訳（メタデータ） (2020-06-06T13:34:41Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。