Fugu-MT 論文翻訳(概要): Q-learning with Uniformly Bounded Variance: Large Discounting is Not a Barrier to Fast Learning

論文の概要: Q-learning with Uniformly Bounded Variance: Large Discounting is Not a Barrier to Fast Learning

arxiv url: http://arxiv.org/abs/2002.10301v2
Date: Tue, 7 Jul 2020 21:58:23 GMT
ステータス: 翻訳完了
システム内更新日: 2022-12-29 03:28:29.063308
Title: Q-learning with Uniformly Bounded Variance: Large Discounting is Not a Barrier to Fast Learning
Title（参考訳）: 一様境界変数を用いたQ-ラーニング:大規模ディスカウントは高速学習の障壁ではない
Authors: Adithya M. Devraj and Sean P. Meyn
Abstract要約: 我々は、すべての$gamma 1$に対して一様に束縛されたサンプル複雑性を持つ新しいアルゴリズムのクラスを導入する。最適化されたステップサイズシーケンスとQ-ラーニングアルゴリズムの共分散は、1/(1-ガンマ)$の二次関数であることを示す。
参考スコア（独自算出の注目度）: 7.6146285961466
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sample complexity bounds are a common performance metric in the Reinforcement Learning literature. In the discounted cost, infinite horizon setting, all of the known bounds have a factor that is a polynomial in $1/(1-\gamma)$, where $\gamma < 1$ is the discount factor. For a large discount factor, these bounds seem to imply that a very large number of samples is required to achieve an $\varepsilon$-optimal policy. The objective of the present work is to introduce a new class of algorithms that have sample complexity uniformly bounded for all $\gamma < 1$. One may argue that this is impossible, due to a recent min-max lower bound. The explanation is that this previous lower bound is for a specific problem, which we modify, without compromising the ultimate objective of obtaining an $\varepsilon$-optimal policy. Specifically, we show that the asymptotic covariance of the Q-learning algorithm with an optimized step-size sequence is a quadratic function of $1/(1-\gamma)$; an expected, and essentially known result. The new relative Q-learning algorithm proposed here is shown to have asymptotic covariance that is a quadratic in $1/(1- \rho^* \gamma)$, where $1 - \rho^* > 0$ is an upper bound on the spectral gap of an optimal transition matrix.
Abstract（参考訳）: サンプル複雑性境界は強化学習文献における一般的なパフォーマンス指標である。ディスカウントコスト、無限地平線設定において、既知のすべての境界は1/(1-\gamma)$の多項式である係数を持ち、ここで$\gamma < 1$がディスカウント係数である。大きな割引係数の場合、これらの境界は、非常に多くのサンプルが$\varepsilon$-optimal Policyを達成するために必要であることを示している。本研究の目的は、すべての$\gamma < 1$に対して一様有界なサンプル複雑性を持つ新しいアルゴリズムのクラスを導入することである。最近のmin-max下限のため、これは不可能であると主張する人もいる。この説明では、この前の下限は、$\varepsilon$-optimalポリシーを得る究極の目的を妥協することなく修正する特定の問題に対するものである。具体的には、Q-ラーニングアルゴリズムと最適化されたステップサイズシーケンスの漸近共分散が、1/(1-\gamma)$の二次関数であることを示す。ここで提案する新しい相対的q-ラーニングアルゴリズムは、1/(1- \rho^* \gamma)$ の二次である漸近共分散を持ち、1 - \rho^* > 0$ は最適遷移行列のスペクトルギャップの上界である。

関連論文リスト

SoS Certificates for Sparse Singular Values and Their Applications: Robust Statistics, Subspace Distortion, and More [37.208622097149714]
我々は、最大$|M u|$で境界を証明できる新しいアップタイムアルゴリズムの族を与える。我々の認証アルゴリズムは, Sum-of-Squares階層を必須に活用する。
論文参考訳（メタデータ） (2024-12-30T18:59:46Z)
Efficiently Learning One-Hidden-Layer ReLU Networks via Schur Polynomials [50.90125395570797]
正方形損失に関して、標準的なガウス分布の下での$k$ReLU活性化の線形結合をPAC学習する問題をmathbbRd$で検討する。本研究の主な成果は,この学習課題に対して,サンプルおよび計算複雑性が$(dk/epsilon)O(k)$で,epsilon>0$が目標精度である。
論文参考訳（メタデータ） (2023-07-24T14:37:22Z)
Near-Optimal Bounds for Learning Gaussian Halfspaces with Random Classification Noise [50.64137465792738]
この問題に対する効率的なSQアルゴリズムは、少なくとも$Omega(d1/2/(maxp, epsilon)2)$. のサンプル複雑性を必要とする。我々の下限は、この1/epsilon$に対する二次的依存は、効率的なアルゴリズムに固有のものであることを示唆している。
論文参考訳（メタデータ） (2023-07-13T18:59:28Z)
A Newton-CG based barrier-augmented Lagrangian method for general nonconvex conic optimization [53.044526424637866]
本稿では、2つの異なる対象の一般円錐最適化を最小化する近似二階定常点(SOSP)について検討する。特に、近似SOSPを見つけるためのNewton-CGベースの拡張共役法を提案する。
論文参考訳（メタデータ） (2023-01-10T20:43:29Z)
Small Covers for Near-Zero Sets of Polynomials and Learning Latent Variable Models [56.98280399449707]
我々は、s$ of cardinality $m = (k/epsilon)o_d(k1/d)$ に対して $epsilon$-cover が存在することを示す。構造的結果に基づいて,いくつかの基本的高次元確率モデル隠れ変数の学習アルゴリズムを改良した。
論文参考訳（メタデータ） (2020-12-14T18:14:08Z)
Empirical Risk Minimization in the Non-interactive Local Model of Differential Privacy [26.69391745812235]
本研究では,非対話型局所微分プライバシー(LDP)モデルにおける経験的リスク最小化(ERM)問題について検討する。これまでの研究では、誤差$alphaを達成するためには、一般的な損失関数の次元$p$に依存する必要があることが示されている。
論文参考訳（メタデータ） (2020-11-11T17:48:00Z)
Finite-Time Analysis for Double Q-learning [50.50058000948908]
二重Q-ラーニングのための非漸近的有限時間解析を初めて提供する。同期と非同期の二重Q-ラーニングの両方が,グローバル最適化の$epsilon$-accurate近辺に収束することが保証されていることを示す。
論文参考訳（メタデータ） (2020-09-29T18:48:21Z)
Streaming Complexity of SVMs [110.63976030971106]
本稿では,ストリーミングモデルにおけるバイアス正規化SVM問題を解く際の空間複雑性について検討する。両方の問題に対して、$frac1lambdaepsilon$の次元に対して、$frac1lambdaepsilon$よりも空間的に小さいストリーミングアルゴリズムを得ることができることを示す。
論文参考訳（メタデータ） (2020-07-07T17:10:00Z)
Gradient-Variation Bound for Online Convex Optimization with Constraints [25.002868073267464]
複数の機能的制約とユークリッド球のような比較的単純な制約セットからなる制約を持つオンライン凸最適化について検討する。一階法は、$mathcalO(sqrtT)$ regret と $mathcalO(1)$ 制約違反を達成するが、問題の構造的情報を考慮してはならない。本稿では,新しいオンライン原始双対ミラー-プロキシアルゴリズムによって得られる複雑な制約を伴って,オンライン凸最適化のインセンタンス依存バウンダリを提供する。
論文参考訳（メタデータ） (2020-06-22T17:38:14Z)
Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity [59.34067736545355]
S$状態、$A$アクション、割引係数$gamma in (0,1)$、近似しきい値$epsilon > 0$の MDP が与えられた場合、$epsilon$-Optimal Policy を学ぶためのモデルなしアルゴリズムを提供する。十分小さな$epsilon$の場合、サンプルの複雑さで改良されたアルゴリズムを示す。
論文参考訳（メタデータ） (2020-06-06T13:34:41Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。