Fugu-MT 論文翻訳(概要): A Note on Target Q-learning For Solving Finite MDPs with A Generative Oracle

論文の概要: A Note on Target Q-learning For Solving Finite MDPs with A Generative Oracle

arxiv url: http://arxiv.org/abs/2203.11489v1
Date: Tue, 22 Mar 2022 06:53:42 GMT
ステータス: 翻訳完了
システム内更新日: 2022-03-24 04:15:39.671201
Title: A Note on Target Q-learning For Solving Finite MDPs with A Generative Oracle
Title（参考訳）: 生成オラクルを用いた有限mdp解のためのターゲットq-learningについて
Authors: Ziniu Li, Tian Xu, Yang Yu
Abstract要約: We show that the sample complexity of the target Q-learning algorithm in [Lee and He, 2020] is $widetildemathcal O(|mathcal S|2|mathcal A|2 (1-gamma)-5varepsilon-2)$。バニラQ-ラーニングと比較すると、周期的に凍結したターゲットQ-関数の導入は、サンプルの複雑さを犠牲にしない。
参考スコア（独自算出の注目度）: 11.154257789731467
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Q-learning with function approximation could diverge in the off-policy setting and the target network is a powerful technique to address this issue. In this manuscript, we examine the sample complexity of the associated target Q-learning algorithm in the tabular case with a generative oracle. We point out a misleading claim in [Lee and He, 2020] and establish a tight analysis. In particular, we demonstrate that the sample complexity of the target Q-learning algorithm in [Lee and He, 2020] is $\widetilde{\mathcal O}(|\mathcal S|^2|\mathcal A|^2 (1-\gamma)^{-5}\varepsilon^{-2})$. Furthermore, we show that this sample complexity is improved to $\widetilde{\mathcal O}(|\mathcal S||\mathcal A| (1-\gamma)^{-5}\varepsilon^{-2})$ if we can sequentially update all state-action pairs and $\widetilde{\mathcal O}(|\mathcal S||\mathcal A| (1-\gamma)^{-4}\varepsilon^{-2})$ if $\gamma$ is further in $(1/2, 1)$. Compared with the vanilla Q-learning, our results conclude that the introduction of a periodically-frozen target Q-function does not sacrifice the sample complexity.
Abstract（参考訳）: 関数近似によるq-learningは、オフ・ポリシー設定で分岐する可能性があり、ターゲットネットワークはこの問題に対処する強力な技術である。本論文では,対象とするq-learningアルゴリズムのサンプル複雑性を,生成的オラクルを用いた表式ケースで検証する。我々は[Lee and He, 2020]で誤解を招く主張を指摘し、厳密な分析を確立します。特に、[Lee and He, 2020]における対象Q-ラーニングアルゴリズムのサンプル複雑性は、$\widetilde{\mathcal O}(|\mathcal S|^2|\mathcal A|^2 (1-\gamma)^{-5}\varepsilon^{-2})$であることを示す。さらに、このサンプルの複雑さは$\widetilde{\mathcal O}(|\mathcal S||\mathcal A| (1-\gamma)^{-5}\varepsilon^{-2})$と$\widetilde{\mathcal O}(|\mathcal S||\mathcal A| (1-\gamma)^{-4}\varepsilon^{-2})$に改善される。バニラQ-ラーニングと比較すると、周期的に凍結したターゲットQ-関数の導入は、サンプルの複雑さを犠牲にしない。

関連論文リスト

Improved sampling algorithms and Poincaré inequalities for non-log-concave distributions [1.9753732769115382]
これは$d$と$frac1epsilon$である。 $L=mathcalO(1)$と$M=mathrmpoly(d)$である。
論文参考訳（メタデータ） (2025-07-15T12:06:11Z)
On the Complexity of Learning Sparse Functions with Statistical and Gradient Queries [25.03801392644285]
一般的な製品分布に対するスパース関数のサポートを学習するために,$mathsfDLQ$のクエリ複雑性を厳密に評価する。正方形損失の場合、$mathsfDLQ$は、$mathsfSQ$よりも潜在的にずっと悪い相関統計クエリ$(mathsfCSQ)$-の複雑さと一致する。また、$mathsfDLQ$が2層学習の複雑さを正確に記述することで、(確率的な)勾配勾配で学習をキャプチャできることを示す。
論文参考訳（メタデータ） (2024-07-08T05:30:34Z)
Optimal Sample Complexity for Average Reward Markov Decision Processes [1.0445957451908694]
平均報酬 MDP の最適ポリシを$widetilde O(|S||A|t_textmix2 epsilon-2)$で推定する。これは文学の下位境界に到達した最初のアルゴリズムと分析である。
論文参考訳（メタデータ） (2023-10-13T03:08:59Z)
Efficiently Learning One-Hidden-Layer ReLU Networks via Schur Polynomials [50.90125395570797]
正方形損失に関して、標準的なガウス分布の下での$k$ReLU活性化の線形結合をPAC学習する問題をmathbbRd$で検討する。本研究の主な成果は,この学習課題に対して,サンプルおよび計算複雑性が$(dk/epsilon)O(k)$で,epsilon>0$が目標精度である。
論文参考訳（メタデータ） (2023-07-24T14:37:22Z)
Near-Optimal Bounds for Learning Gaussian Halfspaces with Random Classification Noise [50.64137465792738]
この問題に対する効率的なSQアルゴリズムは、少なくとも$Omega(d1/2/(maxp, epsilon)2)$. のサンプル複雑性を必要とする。我々の下限は、この1/epsilon$に対する二次的依存は、効率的なアルゴリズムに固有のものであることを示唆している。
論文参考訳（メタデータ） (2023-07-13T18:59:28Z)
Information-Computation Tradeoffs for Learning Margin Halfspaces with Random Classification Noise [50.64137465792738]
ランダム分類ノイズを用いたPAC$gamma$-marginハーフスペースの問題について検討する。我々は、問題のサンプル複雑性と計算効率の良いアルゴリズムのサンプル複雑性との間に固有のギャップを示唆する情報計算トレードオフを確立する。
論文参考訳（メタデータ） (2023-06-28T16:33:39Z)
SQ Lower Bounds for Learning Bounded Covariance GMMs [46.289382906761304]
P= sum_i=1k w_i MathcalN(boldsymbol mu_i,mathbf Sigma_i)$ という形で、分離されたガウスの混合を $mathbbRd$ で学習することに焦点を当てる。この問題に対する統計的クエリ(SQ)アルゴリズムは、少なくともdOmega (1/epsilon)$の複雑さを必要とすることを証明している。
論文参考訳（メタデータ） (2023-06-22T17:23:36Z)
Learning a Single Neuron with Adversarial Label Noise via Gradient Descent [50.659479930171585]
モノトン活性化に対する $mathbfxmapstosigma(mathbfwcdotmathbfx)$ の関数について検討する。学習者の目標は仮説ベクトル $mathbfw$ that $F(mathbbw)=C, epsilon$ を高い確率で出力することである。
論文参考訳（メタデータ） (2022-06-17T17:55:43Z)
Statistically Near-Optimal Hypothesis Selection [33.83129262033921]
仮説選択問題に対する最適2ドルの近似学習戦略を導出する。これは、最適な近似係数とサンプルの複雑さを同時に達成する最初のアルゴリズムである。
論文参考訳（メタデータ） (2021-08-17T21:11:20Z)
Tightening the Dependence on Horizon in the Sample Complexity of Q-Learning [59.71676469100807]
この研究は、同期Q-ラーニングのサンプルの複雑さを、任意の$0varepsilon 1$に対して$frac|mathcalS| (1-gamma)4varepsilon2$の順序に絞る。計算やストレージを余分に必要とせずに、高速なq-learningにマッチするvanilla q-learningの有効性を明らかにした。
論文参考訳（メタデータ） (2021-02-12T14:22:05Z)
Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity [59.34067736545355]
S$状態、$A$アクション、割引係数$gamma in (0,1)$、近似しきい値$epsilon > 0$の MDP が与えられた場合、$epsilon$-Optimal Policy を学ぶためのモデルなしアルゴリズムを提供する。十分小さな$epsilon$の場合、サンプルの複雑さで改良されたアルゴリズムを示す。
論文参考訳（メタデータ） (2020-06-06T13:34:41Z)
Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction [63.41789556777387]
非同期Q-ラーニングはマルコフ決定過程(MDP)の最適行動値関数(またはQ-関数)を学習することを目的としている。 Q-関数の入出力$varepsilon$-正確な推定に必要なサンプルの数は、少なくとも$frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$の順である。
論文参考訳（メタデータ） (2020-06-04T17:51:00Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。