Fugu-MT 論文翻訳(概要): Policy evaluation from a single path: Multi-step methods, mixing and mis-specification

論文の概要: Policy evaluation from a single path: Multi-step methods, mixing and mis-specification

arxiv url: http://arxiv.org/abs/2211.03899v1
Date: Mon, 7 Nov 2022 23:15:25 GMT
ステータス: 翻訳完了
システム内更新日: 2022-11-09 15:52:57.459404
Title: Policy evaluation from a single path: Multi-step methods, mixing and mis-specification
Title（参考訳）: 単一経路からの政策評価:多段階法・混合・誤特定
Authors: Yaqi Duan, Martin J. Wainwright
Abstract要約: 無限水平$gamma$-discounted Markov rewardプロセスの値関数の非パラメトリック推定について検討した。カーネルベースの多段階時間差推定の一般的なファミリーに対して、漸近的でない保証を提供する。
参考スコア（独自算出の注目度）: 45.88067550131531
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study non-parametric estimation of the value function of an infinite-horizon $\gamma$-discounted Markov reward process (MRP) using observations from a single trajectory. We provide non-asymptotic guarantees for a general family of kernel-based multi-step temporal difference (TD) estimates, including canonical $K$-step look-ahead TD for $K = 1, 2, \ldots$ and the TD$(\lambda)$ family for $\lambda \in [0,1)$ as special cases. Our bounds capture its dependence on Bellman fluctuations, mixing time of the Markov chain, any mis-specification in the model, as well as the choice of weight function defining the estimator itself, and reveal some delicate interactions between mixing time and model mis-specification. For a given TD method applied to a well-specified model, its statistical error under trajectory data is similar to that of i.i.d. sample transition pairs, whereas under mis-specification, temporal dependence in data inflates the statistical error. However, any such deterioration can be mitigated by increased look-ahead. We complement our upper bounds by proving minimax lower bounds that establish optimality of TD-based methods with appropriately chosen look-ahead and weighting, and reveal some fundamental differences between value function estimation and ordinary non-parametric regression.
Abstract（参考訳）: 無限水平$\gamma$-discounted Markov reward process (MRP) の値関数のパラメトリック推定を単一軌道からの観測により検討した。我々は、カーネルベースの多段階時間差(td)推定の一般ファミリーに対して、k = 1, 2, \ldots$ に対して、k$-step look-ahead td を、特別な場合として、td$(\lambda)$ family を $\lambda \in [0,1)$ とする非漸近的保証を提供する。我々の境界はベルマンの揺らぎ、マルコフ連鎖の混合時間、モデル内の任意の誤特定、および推定器自身を定義する重み関数の選択に依存し、混合時間とモデルの誤特定の間の微妙な相互作用を明らかにする。適切に特定されたモデルに適用された与えられたTD法の場合、軌道データの下での統計誤差はサンプル遷移ペアのものと似ているが、データ中の時間的依存は統計誤差を膨らませる。しかし、このような劣化はルックアヘッドの増加によって緩和できる。我々は,TD手法の最適性を適切に選択したルックアヘッドと重み付けで証明するミニマックス下界を証明し,値関数推定と通常の非パラメトリック回帰の基本的な相違を明らかにする。

関連論文リスト

Statistical guarantees for continuous-time policy evaluation: blessing of ellipticity and new tradeoffs [2.926192989090622]
連続時間マルコフ拡散過程における値関数の推定について検討する。我々の研究は、最小二乗時間差法に対して漸近的でない統計的保証を提供する。
論文参考訳（メタデータ） (2025-02-06T18:39:03Z)
Markov Chain Variance Estimation: A Stochastic Approximation Approach [14.883782513177094]
マルコフ連鎖上で定義される関数の分散を推定する問題は、定常平均の統計的推測の重要なステップである。我々は,各ステップで$O(1)$を必要とする新しい再帰的推定器を設計し,過去のサンプルやラン長の知識を一切必要とせず,証明可能な有限サンプル保証付き平均二乗誤差(MSE)に対する最適な$O(frac1n)の収束率を有する。
論文参考訳（メタデータ） (2024-09-09T15:42:28Z)
Multivariate root-n-consistent smoothing parameter free matching estimators and estimators of inverse density weighted expectations [51.000851088730684]
我々は、パラメトリックな$sqrt n $-rateで収束する、最も近い隣人の新しい修正とマッチング推定器を開発する。我々は,非パラメトリック関数推定器は含まないこと,特に標本サイズ依存パラメータの平滑化には依存していないことを強調する。
論文参考訳（メタデータ） (2024-07-11T13:28:34Z)
Statistical Efficiency of Distributional Temporal Difference Learning and Freedman's Inequality in Hilbert Spaces [24.03281329962804]
本稿では,分布時間差学習における非漸近的統計率に着目した。生成モデルを用いたNTDの場合、$tildeO(varepsilon-2 mu_pi,min-1 (1-gamma)-3+t_mixmu_pi,min-1 (1-gamma)-1)$サンプル複雑性境界はワッサーシュタイン距離が1ドルである場合に必要である。我々は新しいフリードマンの不平等を樹立する
論文参考訳（メタデータ） (2024-03-09T06:19:53Z)
Kernel-based off-policy estimation without overlap: Instance optimality beyond semiparametric efficiency [53.90687548731265]
本研究では,観測データに基づいて線形関数を推定するための最適手順について検討する。任意の凸および対称函数クラス $mathcalF$ に対して、平均二乗誤差で有界な非漸近局所ミニマックスを導出する。
論文参考訳（メタデータ） (2023-01-16T02:57:37Z)
On Well-posedness and Minimax Optimal Rates of Nonparametric Q-function Estimation in Off-policy Evaluation [1.575865518040625]
連続状態と行動を伴う無限水平マルコフ決定過程における非政治評価問題について検討する。我々は、$Q$関数推定を非パラメトリックインスツルメンタル変数(NPIV)推定問題の特別な形式に再キャストする。
論文参考訳（メタデータ） (2022-01-17T01:09:38Z)
Optimal and instance-dependent guarantees for Markovian linear stochastic approximation [47.912511426974376]
標準スキームの最後の繰り返しの2乗誤差に対して、$t_mathrmmix tfracdn$の非漸近境界を示す。マルコフ雑音による政策評価について,これらの結果のまとめを導出する。
論文参考訳（メタデータ） (2021-12-23T18:47:50Z)
Optimal policy evaluation using kernel-based temporal difference methods [78.83926562536791]
カーネルヒルベルト空間を用いて、無限水平割引マルコフ報酬過程の値関数を推定する。我々は、関連するカーネル演算子の固有値に明示的に依存した誤差の非漸近上界を導出する。 MRP のサブクラスに対する minimax の下位境界を証明する。
論文参考訳（メタデータ） (2021-09-24T14:48:20Z)
Improved Prediction and Network Estimation Using the Monotone Single Index Multi-variate Autoregressive Model [34.529641317832024]
単調単一指数多変量自己回帰モデル(SIMAM)に基づく半パラメトリックアプローチを開発する。我々は、従属データに対する理論的保証と、交互に投影される勾配降下アルゴリズムを提供する。シミュレーションデータと2つの実データ例において,優れた性能を示す。
論文参考訳（メタデータ） (2021-06-28T12:32:29Z)
Estimation in Tensor Ising Models [5.161531917413708]
N$ノード上の分布から1つのサンプルを与えられた$p$-tensor Isingモデルの自然パラメータを推定する問題を考える。特に、$sqrt N$-consistency of the MPL estimate in the $p$-spin Sherrington-Kirkpatrick (SK) model。我々は、$p$-tensor Curie-Weiss モデルの特別な場合における MPL 推定の正確なゆらぎを導出する。
論文参考訳（メタデータ） (2020-08-29T00:06:58Z)
SUMO: Unbiased Estimation of Log Marginal Probability for Latent Variable Models [80.22609163316459]
無限級数のランダム化トランケーションに基づく潜在変数モデルに対して、ログ境界確率の非バイアス推定器とその勾配を導入する。推定器を用いてトレーニングしたモデルは、同じ平均計算コストに対して、標準的な重要度サンプリングに基づくアプローチよりも優れたテストセット確率を与えることを示す。
論文参考訳（メタデータ） (2020-04-01T11:49:30Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。