Fugu-MT 論文翻訳(概要): Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity

論文の概要: Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity

arxiv url: http://arxiv.org/abs/2006.03864v3
Date: Thu, 24 Dec 2020 18:46:27 GMT
ステータス: 翻訳完了
システム内更新日: 2022-11-24 21:07:20.350824
Title: Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity
Title（参考訳）: モデルフリー強化学習:クリップ型擬似回帰からサンプル複雑性へ
Authors: Zihan Zhang, Yuan Zhou, Xiangyang Ji
Abstract要約: S$状態、$A$アクション、割引係数$gamma in (0,1)$、近似しきい値$epsilon > 0$の MDP が与えられた場合、$epsilon$-Optimal Policy を学ぶためのモデルなしアルゴリズムを提供する。十分小さな$epsilon$の場合、サンプルの複雑さで改良されたアルゴリズムを示す。
参考スコア（独自算出の注目度）: 59.34067736545355
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper we consider the problem of learning an $\epsilon$-optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with $S$ states, $A$ actions, the discount factor $\gamma \in (0,1)$, and an approximation threshold $\epsilon > 0$, we provide a model-free algorithm to learn an $\epsilon$-optimal policy with sample complexity $\tilde{O}(\frac{SA\ln(1/p)}{\epsilon^2(1-\gamma)^{5.5}})$ (where the notation $\tilde{O}(\cdot)$ hides poly-logarithmic factors of $S,A,1/(1-\gamma)$, and $1/\epsilon$) and success probability $(1-p)$. For small enough $\epsilon$, we show an improved algorithm with sample complexity $\tilde{O}(\frac{SA\ln(1/p)}{\epsilon^2(1-\gamma)^{3}})$. While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on $S$, our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.
Abstract（参考訳）: 本稿では,割引マルコフ決定プロセス(MDP)に対する$\epsilon$-optimal Policyの学習問題を考察する。値が$s$、アクションが$a$、ディスカウント係数$\gamma \in (0,1)$、近似しきい値$\epsilon > 0$ が与えられると、サンプル複雑性が$\tilde{o}(\frac{sa\ln(1/p)}{\epsilon^2(1-\gamma)^{5.5}})$ (ここで$\tilde{o}(\cdot)$s poly-logarithmic factors of $s,a,1/(1-\gamma)$, and $1/\epsilon$) を学習するためのモデルフリーなアルゴリズムを提供する。十分小さな$\epsilon$に対して、サンプル複雑性を持つアルゴリズムを改良した$\tilde{O}(\frac{SA\ln(1/p)}{\epsilon^2(1-\gamma)^{3}})$を示す。第1のバウンドは、既知のすべてのモデルフリーアルゴリズムとモデルベースアルゴリズムをS$に強く依存して改善するが、第2のアルゴリズムは、既知のすべての複雑なバウンドを破り、情報理論の下限を対数因子にマッチさせる。

関連論文リスト

A Near-optimal Algorithm for Learning Margin Halfspaces with Massart Noise [36.29182619215646]
我々は、マスアートノイズの存在下で、PACが$gamma$-marginハーフスペースを学習する際の問題について検討する。我々のアルゴリズムは単純で実用的であり、慎重に選択された凸損失の列にオンラインSGDを頼っている。
論文参考訳（メタデータ） (2025-01-16T17:44:18Z)
Sample and Computationally Efficient Robust Learning of Gaussian Single-Index Models [37.42736399673992]
シングルインデックスモデル (SIM) は $sigma(mathbfwast cdot mathbfx)$ という形式の関数であり、$sigma: mathbbR to mathbbR$ は既知のリンク関数であり、$mathbfwast$ は隠れ単位ベクトルである。適切な学習者が$L2$-error of $O(mathrmOPT)+epsilon$。
論文参考訳（メタデータ） (2024-11-08T17:10:38Z)
Iterative thresholding for non-linear learning in the strong $\varepsilon$-contamination model [3.309767076331365]
閾値降下を用いた単一ニューロンモデル学習のための近似境界を導出する。線形回帰問題も研究し、$sigma(mathbfx) = mathbfx$ となる。
論文参考訳（メタデータ） (2024-09-05T16:59:56Z)
Finding good policies in average-reward Markov Decision Processes without prior knowledge [19.89784209009327]
平均回帰決定(MDP)における$varepsilon$-Optimal Policyの同定を再考する。直径推定法を用いて,$(varepsilon,delta)$-PAC-PACポリシー識別のための最初のアルゴリズムを提案する。
論文参考訳（メタデータ） (2024-05-27T12:24:14Z)
Near-Optimal Bounds for Learning Gaussian Halfspaces with Random Classification Noise [50.64137465792738]
この問題に対する効率的なSQアルゴリズムは、少なくとも$Omega(d1/2/(maxp, epsilon)2)$. のサンプル複雑性を必要とする。我々の下限は、この1/epsilon$に対する二次的依存は、効率的なアルゴリズムに固有のものであることを示唆している。
論文参考訳（メタデータ） (2023-07-13T18:59:28Z)
Replicability in Reinforcement Learning [46.89386344741442]
生成モデルにアクセス可能なディスカウント型MDPの基本設定に焦点をあてる。 ImpagliazzoらにインスパイアされたRLアルゴリズムは、高い確率で2回の実行後に全く同じポリシーを出力した場合、複製可能である。
論文参考訳（メタデータ） (2023-05-31T05:16:23Z)
Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
報酬混合マルコフ決定過程(RMMDP)におけるエピソード強化学習の検討我々のゴールは、そのようなモデルにおける時間段階の累積報酬をほぼ最大化する、ほぼ最適に近いポリシーを学ぶことである。
論文参考訳（メタデータ） (2022-10-05T22:52:00Z)
Active Sampling for Linear Regression Beyond the $\ell_2$ Norm [70.49273459706546]
対象ベクトルの少数のエントリのみを問合せすることを目的とした線形回帰のためのアクティブサンプリングアルゴリズムについて検討する。我々はこの$d$への依存が対数的要因まで最適であることを示す。また、損失関数に対して最初の全感度上界$O(dmax1,p/2log2 n)$を提供し、最大で$p$成長する。
論文参考訳（メタデータ） (2021-11-09T00:20:01Z)
Threshold Phenomena in Learning Halfspaces with Massart Noise [56.01192577666607]
ガウス境界の下でのマスアートノイズ付きmathbbRd$におけるPAC学習ハーフスペースの問題について検討する。この結果は,Massartモデルにおける学習ハーフスペースの複雑さを定性的に特徴づけるものである。
論文参考訳（メタデータ） (2021-08-19T16:16:48Z)
Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model [3.749193647980305]
本稿では,一連の状態対応機能を有するマルコフ決定プロセス(MDP)について考察する。モデルに基づくアプローチ(resp.$Q-learning)が、高い確率で$varepsilon$-Optimalポリシーを確実に学習することを示す。
論文参考訳（メタデータ） (2021-05-28T17:49:39Z)
Improved Sample Complexity for Incremental Autonomous Exploration in MDPs [132.88757893161699]
我々は $epsilon$-optimal 目標条件付きポリシーのセットを学び、$ L$ ステップ内で段階的に到達可能なすべての状態を達成します。 DisCoは、コストに敏感な最短経路問題に対して$epsilon/c_min$-optimalポリシーを返すことができる最初のアルゴリズムです。
論文参考訳（メタデータ） (2020-12-29T14:06:09Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。