Fugu-MT 論文翻訳(概要): Finite-Time Analysis for Double Q-learning

論文の概要: Finite-Time Analysis for Double Q-learning

arxiv url: http://arxiv.org/abs/2009.14257v2
Date: Mon, 12 Oct 2020 14:13:15 GMT
ステータス: 翻訳完了
システム内更新日: 2022-10-13 06:03:03.523382
Title: Finite-Time Analysis for Double Q-learning
Title（参考訳）: ダブルq学習のための有限時間解析
Authors: Huaqing Xiong, Lin Zhao, Yingbin Liang, Wei Zhang
Abstract要約: 二重Q-ラーニングのための非漸近的有限時間解析を初めて提供する。同期と非同期の二重Q-ラーニングの両方が,グローバル最適化の$epsilon$-accurate近辺に収束することが保証されていることを示す。
参考スコア（独自算出の注目度）: 50.50058000948908
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although Q-learning is one of the most successful algorithms for finding the best action-value function (and thus the optimal policy) in reinforcement learning, its implementation often suffers from large overestimation of Q-function values incurred by random sampling. The double Q-learning algorithm proposed in~\citet{hasselt2010double} overcomes such an overestimation issue by randomly switching the update between two Q-estimators, and has thus gained significant popularity in practice. However, the theoretical understanding of double Q-learning is rather limited. So far only the asymptotic convergence has been established, which does not characterize how fast the algorithm converges. In this paper, we provide the first non-asymptotic (i.e., finite-time) analysis for double Q-learning. We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $\epsilon$-accurate neighborhood of the global optimum by taking $\tilde{\Omega}\left(\left( \frac{1}{(1-\gamma)^6\epsilon^2}\right)^{\frac{1}{\omega}} +\left(\frac{1}{1-\gamma}\right)^{\frac{1}{1-\omega}}\right)$ iterations, where $\omega\in(0,1)$ is the decay parameter of the learning rate, and $\gamma$ is the discount factor. Our analysis develops novel techniques to derive finite-time bounds on the difference between two inter-connected stochastic processes, which is new to the literature of stochastic approximation.
Abstract（参考訳）: q-learningは強化学習において最善のアクション値関数(つまり最適なポリシー)を見つけるための最も成功したアルゴリズムの1つであるが、その実装はしばしばランダムサンプリングによって生じるq-関数値の過大評価に苦しむ。 The double Q-learning algorithm proposed in~\citet{hasselt2010double} は、2つのQ-estimator間で更新をランダムに切り替えることでそのような過大評価問題を克服し、実際にかなりの人気を得た。しかし、二重Q学習の理論的理解は限られている。これまでのところ、アルゴリズムの収束速度を特徴づけない漸近収束のみが確立されている。本稿では,二重q学習のための非漸近的(すなわち有限時間)解析を初めて提供する。同期および非同期の二重q-ラーニングは、$\tilde{\omega}\left(\left(\left( \frac{1}{(1-\gamma)^6\epsilon^2}\right)^{\frac{1}{\omega}} +\left(\frac{1}{1-\gamma}\right)^{\frac{1}{1-\omega}}\right)$イテレーション、ここで$\omega\in(0,1)$は学習率の減衰パラメータ、$\gamma$は割引係数である。本解析は, 2つの連結確率過程間の差分に関する有限時間境界を導出する新しい手法を開発し, 確率近似の文献に新たな知見を与える。

関連論文リスト

Finite-Time Bounds for Two-Time-Scale Stochastic Approximation with Arbitrary Norm Contractions and Markovian Noise [7.770605097524015]
2時間スケール近似(英: Two-time-scale Approximation、SA)は、強化学習と最適化に応用した反復アルゴリズムである。強化学習の応用により、非線型2時間スケール SA 上の最初の平均正方形を与える。
論文参考訳（メタデータ） (2025-03-24T07:03:23Z)
Two-Step Q-Learning [0.0]
そこで本研究では,重要でない2段階のQ-ラーニングアルゴリズムを提案する。数値実験により、2段階のQ-ラーニングとそのスムーズな変形の優れた性能が示された。
論文参考訳（メタデータ） (2024-07-02T15:39:00Z)
Finite-Time Error Bounds for Greedy-GQ [20.51105692499517]
We show that Greedy-GQ algorithm converges fast-time error。我々の分析は、ステップサイズを選択するために、より高速な収束ステップサイズを提供する。
論文参考訳（メタデータ） (2022-09-06T15:04:57Z)
Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning [77.22019100456595]
通信周波数の異なる分散計算作業者のトレーニングアルゴリズムを示す。本研究では,より厳密な収束率を$mathcalO!!(sigma2-2_avg!)とする。また,不均一性の項は,作業者の平均遅延によっても影響されることを示した。
論文参考訳（メタデータ） (2022-06-16T17:10:57Z)
Tightening the Dependence on Horizon in the Sample Complexity of Q-Learning [59.71676469100807]
この研究は、同期Q-ラーニングのサンプルの複雑さを、任意の$0varepsilon 1$に対して$frac|mathcalS| (1-gamma)4varepsilon2$の順序に絞る。計算やストレージを余分に必要とせずに、高速なq-learningにマッチするvanilla q-learningの有効性を明らかにした。
論文参考訳（メタデータ） (2021-02-12T14:22:05Z)
Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth Nonlinear TD Learning [145.54544979467872]
本稿では,各ステップごとに1つのデータポイントしか必要としない2つの単一スケールシングルループアルゴリズムを提案する。本研究の結果は, 同時一次および二重側収束の形で表される。
論文参考訳（メタデータ） (2020-08-23T20:36:49Z)
Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction [63.41789556777387]
非同期Q-ラーニングはマルコフ決定過程(MDP)の最適行動値関数(またはQ-関数)を学習することを目的としている。 Q-関数の入出力$varepsilon$-正確な推定に必要なサンプルの数は、少なくとも$frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$の順である。
論文参考訳（メタデータ） (2020-06-04T17:51:00Z)
Finite-Time Last-Iterate Convergence for Multi-Agent Learning in Games [116.0771177871705]
我々は,$lambda$-cocoerciveゲーム上での連立OGD学習における有限時間最終点収束率を特徴付ける。新たなダブルストッピング時間法により, この適応アルゴリズムは, 非適応的手法と同じ有限時間終点収束率が得られることを示す。
論文参考訳（メタデータ） (2020-02-23T01:46:34Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。