Fugu-MT 論文翻訳(概要): Minimax Optimal $Q$ Learning with Nearest Neighbors

論文の概要: Minimax Optimal $Q$ Learning with Nearest Neighbors

arxiv url: http://arxiv.org/abs/2308.01490v1
Date: Thu, 3 Aug 2023 01:08:53 GMT
ステータス: 翻訳完了
システム内更新日: 2023-08-04 15:37:54.263634
Title: Minimax Optimal $Q$ Learning with Nearest Neighbors
Title（参考訳）: minimax optimal $q$ learning with nearest neighbors (特集ミニマックス)
Authors: Puning Zhao, Lifeng Lai
Abstract要約: 推定$Q$関数の収束率は$tildeO(T-1/(d+3))$であり、$tildeOmega(T-1/(d+2))$よりも遅い。本稿では,収束率のギャップを埋める2つの新しいQ$学習法を提案し,そのうちの1つはオフラインで,もう1つはオンラインである。
参考スコア（独自算出の注目度）: 56.29748742084386
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: $Q$ learning is a popular model free reinforcement learning method. Most of existing works focus on analyzing $Q$ learning for finite state and action spaces. If the state space is continuous, then the original $Q$ learning method can not be directly used. A modification of the original $Q$ learning method was proposed in (Shah and Xie, 2018), which estimates $Q$ values with nearest neighbors. Such modification makes $Q$ learning suitable for continuous state space. (Shah and Xie, 2018) shows that the convergence rate of estimated $Q$ function is $\tilde{O}(T^{-1/(d+3)})$, which is slower than the minimax lower bound $\tilde{\Omega}(T^{-1/(d+2)})$, indicating that this method is not efficient. This paper proposes two new $Q$ learning methods to bridge the gap of convergence rates in (Shah and Xie, 2018), with one of them being offline, while the other is online. Despite that we still use nearest neighbor approach to estimate $Q$ function, the algorithms are crucially different from (Shah and Xie, 2018). In particular, we replace the kernel nearest neighbor in discretized region with a direct nearest neighbor approach. Consequently, our approach significantly improves the convergence rate. Moreover, the time complexity is also significantly improved in high dimensional state spaces. Our analysis shows that both offline and online methods are minimax rate optimal.
Abstract（参考訳）: q$ learningは人気のあるモデルフリーの強化学習方法だ。既存の作業の多くは、有限状態およびアクション空間に対する$Q$学習の分析に重点を置いている。状態空間が連続であれば、元の$Q$学習メソッドを直接使用することはできない。元の$q$学習法の修正が (shah and xie, 2018) で提案され、これは$q$の値が近辺の値と推定される。このような修正により、連続状態空間に適した学習が$Q$になる。 (Shah and Xie, 2018) は、推定$Q$関数の収束速度が $\tilde{O}(T^{-1/(d+3)})$ であることを示し、これはミニマックス下界$\tilde{\Omega}(T^{-1/(d+2)})$ よりも遅く、この方法が効率的でないことを示している。本稿では,収束率のギャップを埋める2つの新しいQ$学習法(Shah and Xie, 2018)を提案し,そのうちの1つはオフラインであり,もう1つはオンラインである。 Q$関数を推定するために、近傍のアプローチは依然として使われているが、アルゴリズムは(Shah and Xie, 2018)非常に異なる。特に、離散化された領域におけるカーネル近傍のアプローチを、直接近接したアプローチで置き換える。したがって,本手法は収束率を大幅に向上させる。さらに、高次元状態空間においても時間複雑性が著しく改善される。分析の結果,オフラインとオンラインの両方の手法が最適であることがわかった。

関連論文リスト

Misspecified $Q$-Learning with Sparse Linear Function Approximation: Tight Bounds on Approximation Error [25.777423855881878]
我々は、$Oleft(Hepsilonright)$-optimal Policyを得ることができることを示す新しい除去アルゴリズムを示す。我々は上界を$widetildeOmegaleft(Hepsilonright)$-optimality lower boundで補い、この問題の完全な図面を与える。
論文参考訳（メタデータ） (2024-07-18T15:58:04Z)
Truncated Variance Reduced Value Iteration [23.282578280033622]
我々は、割引マルコフ決定プロセスにおいて、$epsilon$-optimal Policyを計算するための高速なランダム化アルゴリズムを提供する。この結果から,モデルフリー法とモデルベース法とでは,サンプル・複雑さのギャップを埋めることができた。
論文参考訳（メタデータ） (2024-05-21T17:28:06Z)
Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
本稿では,円滑なベルマン作用素を持つ連続空間マルコフ決定過程(MDP)の一般クラスにおいて,$varepsilon$-optimal Policyを学習する問題を考察する。我々のソリューションの鍵となるのは、調和解析のアイデアに基づく新しい射影技術である。我々の結果は、連続空間 MDP における2つの人気と矛盾する視点のギャップを埋めるものである。
論文参考訳（メタデータ） (2024-05-10T09:58:47Z)
Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes [21.77276136591518]
我々はマルコフ決定過程(MDPs)のための証明可能なモデルフリー強化学習(RL)アルゴリズムを開発した。シミュレータ設定では,$widetildeO left(fracSAmathrmsp(h*)epsilon2+fracS2Amathrmsp(h*)epsilon2right)$サンプルを用いて,$epsilon$-optimal Policyを求める。
論文参考訳（メタデータ） (2023-06-28T17:43:19Z)
Near Sample-Optimal Reduction-based Policy Learning for Average Reward MDP [58.13930707612128]
この研究は、平均報酬マルコフ決定過程(AMDP)における$varepsilon$-Optimal Policyを得る際のサンプルの複雑さを考察する。我々は、状態-作用対当たりの$widetilde O(H varepsilon-3 ln frac1delta)$サンプルを証明し、$H := sp(h*)$は任意の最適ポリシーのバイアスのスパンであり、$varepsilon$は精度、$delta$は失敗確率である。
論文参考訳（メタデータ） (2022-12-01T15:57:58Z)
Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
報酬混合マルコフ決定過程(RMMDP)におけるエピソード強化学習の検討我々のゴールは、そのようなモデルにおける時間段階の累積報酬をほぼ最大化する、ほぼ最適に近いポリシーを学ぶことである。
論文参考訳（メタデータ） (2022-10-05T22:52:00Z)
Settling the Sample Complexity of Model-Based Offline Reinforcement Learning [50.5790774201146]
オフライン強化学習(RL)は、事前収集されたデータを用いて、さらなる探索を行わずに学習する。事前のアルゴリズムや分析は、最適なサンプルの複雑さに悩まされるか、サンプルの最適性に到達するために高いバーンインコストがかかるかのいずれかである。モデルベース(あるいは"プラグイン")アプローチは,バーンインコストを伴わずに,最小限のサンプル複雑性を実現することを実証する。
論文参考訳（メタデータ） (2022-04-11T17:26:19Z)
Nearly Horizon-Free Offline Reinforcement Learning [97.36751930393245]
S$状態、$A$アクション、計画的地平$H$で、エピソードな時間同質なMarkov決定プロセスに関するオフライン強化学習を再考する。経験的MDPを用いた評価と計画のための,約$H$自由なサンプル複雑性境界の最初の集合を得る。
論文参考訳（メタデータ） (2021-03-25T18:52:17Z)
Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity [59.34067736545355]
S$状態、$A$アクション、割引係数$gamma in (0,1)$、近似しきい値$epsilon > 0$の MDP が与えられた場合、$epsilon$-Optimal Policy を学ぶためのモデルなしアルゴリズムを提供する。十分小さな$epsilon$の場合、サンプルの複雑さで改良されたアルゴリズムを示す。
論文参考訳（メタデータ） (2020-06-06T13:34:41Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。