Fugu-MT 論文翻訳(概要): Near Minimax-Optimal Distributional Temporal Difference Algorithms and The Freedman Inequality in Hilbert Spaces

論文の概要: Near Minimax-Optimal Distributional Temporal Difference Algorithms and The Freedman Inequality in Hilbert Spaces

arxiv url: http://arxiv.org/abs/2403.05811v2
Date: Thu, 14 Mar 2024 09:24:51 GMT
ステータス: 翻訳完了
システム内更新日: 2024-03-16 01:11:34.042622
Title: Near Minimax-Optimal Distributional Temporal Difference Algorithms and The Freedman Inequality in Hilbert Spaces
Title（参考訳）: ヒルベルト空間における極小分布時間差アルゴリズムと自由マン不等式
Authors: Yang Peng, Liangyu Zhang, Zhihua Zhang,
Abstract要約: 我々は,非パラメトリック分布型TDアルゴリズム (NTD) を$gamma$-discounted infinite-horizon Markov決定プロセスに対して提案する。我々はヒルベルト空間における新しいフリードマンの不等式を確立し、これは独立な関心事である。
参考スコア（独自算出の注目度）: 24.03281329962804
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Distributional reinforcement learning (DRL) has achieved empirical success in various domains. One of the core tasks in the field of DRL is distributional policy evaluation, which involves estimating the return distribution $\eta^\pi$ for a given policy $\pi$. The distributional temporal difference (TD) algorithm has been accordingly proposed, which is an extension of the temporal difference algorithm in the classic RL literature. In the tabular case, \citet{rowland2018analysis} and \citet{rowland2023analysis} proved the asymptotic convergence of two instances of distributional TD, namely categorical temporal difference algorithm (CTD) and quantile temporal difference algorithm (QTD), respectively. In this paper, we go a step further and analyze the finite-sample performance of distributional TD. To facilitate theoretical analysis, we propose a non-parametric distributional TD algorithm (NTD). For a $\gamma$-discounted infinite-horizon tabular Markov decision process, we show that for NTD we need $\tilde{O}\left(\frac{1}{\varepsilon^{2p}(1-\gamma)^{2p+1}}\right)$ iterations to achieve an $\varepsilon$-optimal estimator with high probability, when the estimation error is measured by the $p$-Wasserstein distance. This sample complexity bound is minimax optimal (up to logarithmic factors) in the case of the $1$-Wasserstein distance. To achieve this, we establish a novel Freedman's inequality in Hilbert spaces, which would be of independent interest. In addition, we revisit CTD, showing that the same non-asymptotic convergence bounds hold for CTD in the case of the $p$-Wasserstein distance.
Abstract（参考訳）: 分散強化学習(DRL)は様々な領域で実証的な成功を収めている。 DRL の分野における中核的なタスクの1つは、あるポリシーに対する戻り分布 $\eta^\pi$ を推定する分散ポリシー評価である。従来のRL文献における時間差分法の拡張である分布時間差分法(TD)アルゴリズムが提案されている。表の例では、 \citet{rowland2018analysis} と \citet{rowland2023analysis} は、分布的TDの2つの例、すなわち、カテゴリー的時間差アルゴリズム (CTD) と量子的時間差アルゴリズム (QTD) の漸近収束をそれぞれ証明した。本稿では、さらに一歩進んで、分布性TDの有限サンプル性能を解析する。理論解析を容易にするために,非パラメトリック分布型TDアルゴリズム(NTD)を提案する。 $\gamma$-discounted infinite-horizon tabular Markov decision processでは、NTD に対して$\tilde{O}\left(\frac{1}{\varepsilon^{2p}(1-\gamma)^{2p+1}}\right)$ iterations to achieve a $\varepsilon$-optimal estimator with high probability, when the estimation error is measured by the $p-Wasserstein distance。このサンプル複雑性境界は、ワッサーシュタイン距離が1ドルである場合、最小最大(対数因子まで)である。これを達成するために、ヒルベルト空間における新しいフリードマンの不等式(英語版)(Freedman's inequality)を確立する。さらに我々はCTDを再検討し,同じ非漸近収束境界が$p$-Wasserstein距離の場合,CTDに対して成り立つことを示した。

関連論文リスト

A Sharp Convergence Theory for The Probability Flow ODEs of Diffusion Models [45.60426164657739]
拡散型サンプリング器の非漸近収束理論を開発する。我々は、$d/varepsilon$がターゲット分布を$varepsilon$トータル偏差距離に近似するのに十分であることを証明した。我々の結果は、$ell$のスコア推定誤差がデータ生成プロセスの品質にどのように影響するかも特徴付ける。
論文参考訳（メタデータ） (2024-08-05T09:02:24Z)
Finite Time Analysis of Temporal Difference Learning for Mean-Variance in a Discounted MDP [1.0923877073891446]
割引報酬マルコフ決定プロセスにおける分散政策評価の問題点を考察する。本稿では,線形関数近似(LFA)を用いた時間差分型学習アルゴリズムについて述べる。平均二乗の意味で(i) を保持する有限標本境界と、(ii) テールイテレート平均化を用いる場合の高い確率を導出する。
論文参考訳（メタデータ） (2024-06-12T05:49:53Z)
Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
近年の研究では、再生カーネルヒルベルト空間(RKHS)がニューラルネットワークによる関数のモデル化に適した空間ではないことが示されている。本稿では,有界ノルムを持つオーバーパラメータ化された2層ニューラルネットワークに適した関数空間について検討する。
論文参考訳（メタデータ） (2024-04-29T15:04:07Z)
Convergence Analysis of Probability Flow ODE for Score-based Generative Models [5.939858158928473]
確率フローODEに基づく決定論的サンプリング器の収束特性を理論的・数値的両面から検討する。連続時間レベルでは、ターゲットと生成されたデータ分布の総変動を$mathcalO(d3/4delta1/2)$で表すことができる。
論文参考訳（メタデータ） (2024-04-15T12:29:28Z)
Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
バーンインコストを発生させることなく、最小限の最適後悔を実現する方法を示す。最適値/コストや一定の分散といった問題依存量の影響を明らかにするために、我々の理論を拡張します。
論文参考訳（メタデータ） (2023-07-25T15:42:11Z)
Provably Robust Temporal Difference Learning for Heavy-Tailed Rewards [27.209606183563853]
動的勾配クリッピング機構による時間差(TD)学習は,重み付き報酬分布に対して確実に堅牢化できることを確認した。 TD学習に基づくNACの頑健な変種が$tildemathcalO(varepsilon-frac1p)$サンプル複雑性を達成することを示す。
論文参考訳（メタデータ） (2023-06-20T11:12:21Z)
Towards Faster Non-Asymptotic Convergence for Diffusion-Based Generative Models [49.81937966106691]
我々は拡散モデルのデータ生成過程を理解するための非漸近理論のスイートを開発する。従来の研究とは対照的に,本理論は基本的だが多目的な非漸近的アプローチに基づいて開発されている。
論文参考訳（メタデータ） (2023-06-15T16:30:08Z)
Policy evaluation from a single path: Multi-step methods, mixing and mis-specification [45.88067550131531]
無限水平$gamma$-discounted Markov rewardプロセスの値関数の非パラメトリック推定について検討した。カーネルベースの多段階時間差推定の一般的なファミリーに対して、漸近的でない保証を提供する。
論文参考訳（メタデータ） (2022-11-07T23:15:25Z)
Improved Analysis of Score-based Generative Modeling: User-Friendly Bounds under Minimal Smoothness Assumptions [9.953088581242845]
2次モーメントを持つ任意のデータ分布に対して,コンバージェンス保証と複雑性を提供する。我々の結果は、対数共空性や機能的不等式を前提としない。我々の理論解析は、異なる離散近似の比較を提供し、実際の離散化点の選択を導くかもしれない。
論文参考訳（メタデータ） (2022-11-03T15:51:00Z)
Settling the Sample Complexity of Model-Based Offline Reinforcement Learning [50.5790774201146]
オフライン強化学習(RL)は、事前収集されたデータを用いて、さらなる探索を行わずに学習する。事前のアルゴリズムや分析は、最適なサンプルの複雑さに悩まされるか、サンプルの最適性に到達するために高いバーンインコストがかかるかのいずれかである。モデルベース(あるいは"プラグイン")アプローチは,バーンインコストを伴わずに,最小限のサンプル複雑性を実現することを実証する。
論文参考訳（メタデータ） (2022-04-11T17:26:19Z)
High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad Stepsize [55.0090961425708]
本研究では,AdaGradのスムーズな非確率問題に対する簡易な高確率解析法を提案する。我々はモジュラーな方法で解析を行い、決定論的設定において相補的な$mathcal O (1 / TT)$収束率を得る。我々の知る限りでは、これは真に適応的なスキームを持つAdaGradにとって初めての高い確率である。
論文参考訳（メタデータ） (2022-04-06T13:50:33Z)
Optimal policy evaluation using kernel-based temporal difference methods [78.83926562536791]
カーネルヒルベルト空間を用いて、無限水平割引マルコフ報酬過程の値関数を推定する。我々は、関連するカーネル演算子の固有値に明示的に依存した誤差の非漸近上界を導出する。 MRP のサブクラスに対する minimax の下位境界を証明する。
論文参考訳（メタデータ） (2021-09-24T14:48:20Z)
Limit Distribution Theory for the Smooth 1-Wasserstein Distance with Applications [18.618590805279187]
スムーズな1-ワッサーシュタイン距離 (SWD) $W_1sigma$ は経験的近似における次元の呪いを軽減する手段として最近提案された。この研究は、高次元の極限分布結果を含むSWDの詳細な統計的研究を行う。
論文参考訳（メタデータ） (2021-07-28T17:02:24Z)
Wasserstein distance estimates for the distributions of numerical approximations to ergodic stochastic differential equations [0.3553493344868413]
エルゴード微分方程式のイン分布と強い対数凸の場合の分布との間のワッサースタイン距離について検討した。これにより、過減衰および過減衰ランジュバン力学の文献で提案されている多くの異なる近似を統一的に研究することができる。
論文参考訳（メタデータ） (2021-04-26T07:50:04Z)
Tightening the Dependence on Horizon in the Sample Complexity of Q-Learning [59.71676469100807]
この研究は、同期Q-ラーニングのサンプルの複雑さを、任意の$0varepsilon 1$に対して$frac|mathcalS| (1-gamma)4varepsilon2$の順序に絞る。計算やストレージを余分に必要とせずに、高速なq-learningにマッチするvanilla q-learningの有効性を明らかにした。
論文参考訳（メタデータ） (2021-02-12T14:22:05Z)
Faster Convergence of Stochastic Gradient Langevin Dynamics for Non-Log-Concave Sampling [110.88857917726276]
我々は,非log-concaveとなる分布のクラスからサンプリングするために,勾配ランゲヴィンダイナミクス(SGLD)の新たな収束解析を行う。我々のアプローチの核心は、補助的時間反転型マルコフ連鎖を用いたSGLDのコンダクタンス解析である。
論文参考訳（メタデータ） (2020-10-19T15:23:18Z)
Finite-Time Analysis for Double Q-learning [50.50058000948908]
二重Q-ラーニングのための非漸近的有限時間解析を初めて提供する。同期と非同期の二重Q-ラーニングの両方が,グローバル最適化の$epsilon$-accurate近辺に収束することが保証されていることを示す。
論文参考訳（メタデータ） (2020-09-29T18:48:21Z)
Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction [63.41789556777387]
非同期Q-ラーニングはマルコフ決定過程(MDP)の最適行動値関数(またはQ-関数)を学習することを目的としている。 Q-関数の入出力$varepsilon$-正確な推定に必要なサンプルの数は、少なくとも$frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$の順である。
論文参考訳（メタデータ） (2020-06-04T17:51:00Z)
Non-asymptotic Convergence of Adam-type Reinforcement Learning Algorithms under Markovian Sampling [56.394284787780364]
本稿では、ポリシー勾配(PG)と時間差(TD)学習の2つの基本RLアルゴリズムに対して、最初の理論的収束解析を行う。一般の非線形関数近似の下では、PG-AMSGradは定常点の近傍に収束し、$mathcalO(log T/sqrtT)$である。線形関数近似の下では、一定段階のTD-AMSGradは$mathcalO(log T/sqrtT)の速度で大域的最適化の近傍に収束する。
論文参考訳（メタデータ） (2020-02-15T00:26:49Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。