Fugu-MT 論文翻訳(概要): Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction

論文の概要: Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction

arxiv url: http://arxiv.org/abs/2006.03041v3
Date: Sun, 8 Aug 2021 03:42:12 GMT
ステータス: 翻訳完了
システム内更新日: 2022-11-25 10:02:16.744861
Title: Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction
Title（参考訳）: 非同期Q-Learningのサンプル複雑性:シャーパ解析とばらつき低減
Authors: Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, Yuxin Chen
Abstract要約: 非同期Q-ラーニングはマルコフ決定過程(MDP)の最適行動値関数(またはQ-関数)を学習することを目的としている。 Q-関数の入出力$varepsilon$-正確な推定に必要なサンプルの数は、少なくとも$frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$の順である。
参考スコア（独自算出の注目度）: 63.41789556777387
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP), based on a single trajectory of Markovian samples induced by a behavior policy. Focusing on a $\gamma$-discounted MDP with state space $\mathcal{S}$ and action space $\mathcal{A}$, we demonstrate that the $\ell_{\infty}$-based sample complexity of classical asynchronous Q-learning --- namely, the number of samples needed to yield an entrywise $\varepsilon$-accurate estimate of the Q-function --- is at most on the order of $\frac{1}{\mu_{\min}(1-\gamma)^5\varepsilon^2}+ \frac{t_{mix}}{\mu_{\min}(1-\gamma)}$ up to some logarithmic factor, provided that a proper constant learning rate is adopted. Here, $t_{mix}$ and $\mu_{\min}$ denote respectively the mixing time and the minimum state-action occupancy probability of the sample trajectory. The first term of this bound matches the sample complexity in the synchronous case with independent samples drawn from the stationary distribution of the trajectory. The second term reflects the cost taken for the empirical distribution of the Markovian trajectory to reach a steady state, which is incurred at the very beginning and becomes amortized as the algorithm runs. Encouragingly, the above bound improves upon the state-of-the-art result \cite{qu2020finite} by a factor of at least $|\mathcal{S}||\mathcal{A}|$ for all scenarios, and by a factor of at least $t_{mix}|\mathcal{S}||\mathcal{A}|$ for any sufficiently small accuracy level $\varepsilon$. Further, we demonstrate that the scaling on the effective horizon $\frac{1}{1-\gamma}$ can be improved by means of variance reduction.
Abstract（参考訳）: 非同期Q-ラーニングは、行動ポリシーによって誘導されるマルコフサンプルの1つの軌道に基づいて、マルコフ決定過程(MDP)の最適アクション値関数(またはQ-関数)を学習することを目的としている。 Focusing on a $\gamma$-discounted MDP with state space $\mathcal{S}$ and action space $\mathcal{A}$, we demonstrate that the $\ell_{\infty}$-based sample complexity of classical asynchronous Q-learning --- namely, the number of samples needed to yield an entrywise $\varepsilon$-accurate estimate of the Q-function --- is at most on the order of $\frac{1}{\mu_{\min}(1-\gamma)^5\varepsilon^2}+ \frac{t_{mix}}{\mu_{\min}(1-\gamma)}$ up to some logarithmic factor, provided that a proper constant learning rate is adopted. ここで、$t_{mix}$ と $\mu_{\min}$ はそれぞれ標本軌道の混合時間と最小状態-作用占有確率を表す。この境界の第一項は、軌道の定常分布から独立したサンプルが引き出される同期の場合のサンプル複雑性と一致する。第2の用語は、マルコフ軌道が定常状態に達するための経験的分布に要するコストを反映している。上記の境界は、すべてのシナリオに対して少なくとも$|\mathcal{S}||\mathcal{A}|$の係数と、少なくとも$t_{mix}|\mathcal{S}|||\mathcal{A}|$の係数により、任意の十分小さな精度レベルで$\varepsilon$の係数によって、最先端の結果 \cite{qu2020finite} により改善される。さらに,実効的なhorizon $\frac{1}{1-\gamma}$のスケーリングが分散還元によって改善できることを実証する。

関連論文リスト

Complexity Analysis of Normalizing Constant Estimation: from Jarzynski Equality to Annealed Importance Sampling and beyond [28.931489333515618]
非正規化確率密度 $piproptomathrme-V$ が与えられたとき、正規化定数 $Z=int_mathbbRdmathrme-V(x)mathrmdx$ または自由エネルギー $F=-log Z$ はベイズ統計学、統計力学、機械学習において重要な問題である。本稿では,逆拡散型サンプリング器に基づく新しい正規化定数推定アルゴリズムを提案し,その複雑さを解析するための枠組みを確立する。
論文参考訳（メタデータ） (2025-02-07T00:05:28Z)
Polynomial time sampling from log-smooth distributions in fixed dimension under semi-log-concavity of the forward diffusion with application to strongly dissipative distributions [9.48556659249574]
固定次元の複雑なサンプリングアルゴリズムを提案する。我々は,提案アルゴリズムが予測される$epsilon$誤差を$KL$ばらつきで達成することを証明する。応用として、$L$-log-smooth分布からサンプリングする問題に対する指数関数的複雑性の改善を導出する。
論文参考訳（メタデータ） (2024-12-31T17:51:39Z)
Provable Benefit of Annealed Langevin Monte Carlo for Non-log-concave Sampling [28.931489333515618]
簡単なアンニール型Langevin Monte Carloアルゴリズムに対して$widetildeOleft(fracdbeta2cal A2varepsilon6right)のオラクル複雑性を確立する。例えば、$cal A$ は対象分布 $pi$ と容易にサンプリング可能な分布を補間する確率測度の曲線の作用を表す。
論文参考訳（メタデータ） (2024-07-24T02:15:48Z)
Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
本稿では,円滑なベルマン作用素を持つ連続空間マルコフ決定過程(MDP)の一般クラスにおいて,$varepsilon$-optimal Policyを学習する問題を考察する。我々のソリューションの鍵となるのは、調和解析のアイデアに基づく新しい射影技術である。我々の結果は、連続空間 MDP における2つの人気と矛盾する視点のギャップを埋めるものである。
論文参考訳（メタデータ） (2024-05-10T09:58:47Z)
Multiple-policy Evaluation via Density Estimation [30.914344538340412]
本稿では,この問題に対して$mathrmCAESAR$というアルゴリズムを提案する。低次かつ対数的な$mathrmCAESAR$は、$tildeOleft(fracH4epsilon2sum_h=1Hmax_kin[K]sum_s,afrac(d_hpik(s,a))2mu*_h(s,a)right)$である。
論文参考訳（メタデータ） (2024-03-29T23:55:25Z)
Optimal Sample Complexity for Average Reward Markov Decision Processes [1.0445957451908694]
平均報酬 MDP の最適ポリシを$widetilde O(|S||A|t_textmix2 epsilon-2)$で推定する。これは文学の下位境界に到達した最初のアルゴリズムと分析である。
論文参考訳（メタデータ） (2023-10-13T03:08:59Z)
Settling the Sample Complexity of Model-Based Offline Reinforcement Learning [50.5790774201146]
オフライン強化学習(RL)は、事前収集されたデータを用いて、さらなる探索を行わずに学習する。事前のアルゴリズムや分析は、最適なサンプルの複雑さに悩まされるか、サンプルの最適性に到達するために高いバーンインコストがかかるかのいずれかである。モデルベース(あるいは"プラグイン")アプローチは,バーンインコストを伴わずに,最小限のサンプル複雑性を実現することを実証する。
論文参考訳（メタデータ） (2022-04-11T17:26:19Z)
Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model [3.749193647980305]
本稿では,一連の状態対応機能を有するマルコフ決定プロセス(MDP)について考察する。モデルに基づくアプローチ(resp.$Q-learning)が、高い確率で$varepsilon$-Optimalポリシーを確実に学習することを示す。
論文参考訳（メタデータ） (2021-05-28T17:49:39Z)
Tightening the Dependence on Horizon in the Sample Complexity of Q-Learning [59.71676469100807]
この研究は、同期Q-ラーニングのサンプルの複雑さを、任意の$0varepsilon 1$に対して$frac|mathcalS| (1-gamma)4varepsilon2$の順序に絞る。計算やストレージを余分に必要とせずに、高速なq-learningにマッチするvanilla q-learningの有効性を明らかにした。
論文参考訳（メタデータ） (2021-02-12T14:22:05Z)
Complexity of zigzag sampling algorithm for strongly log-concave distributions [6.336005544376984]
強いログ凹分布に対するジグザグサンプリングアルゴリズムの計算複雑性について検討する。ジグザグサンプリングアルゴリズムは, 計算コストが$obiglに匹敵するchi-squareの発散において, $varepsilon$ 誤差を達成することを証明した。
論文参考訳（メタデータ） (2020-12-21T03:10:21Z)
Finite-Time Analysis for Double Q-learning [50.50058000948908]
二重Q-ラーニングのための非漸近的有限時間解析を初めて提供する。同期と非同期の二重Q-ラーニングの両方が,グローバル最適化の$epsilon$-accurate近辺に収束することが保証されていることを示す。
論文参考訳（メタデータ） (2020-09-29T18:48:21Z)
Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model [50.38446482252857]
本稿では、生成モデル(シミュレータ)へのアクセスを想定して、強化学習のサンプル効率について検討する。最初に$gamma$-discounted infinite-horizon Markov decision process (MDPs) with state space $mathcalS$ and action space $mathcalA$を考える。対象の精度を考慮すれば,モデルに基づく計画アルゴリズムが最小限のサンプルの複雑さを実現するのに十分であることを示す。
論文参考訳（メタデータ） (2020-05-26T17:53:18Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。