Fugu-MT 論文翻訳(概要): Misspecified $Q$-Learning with Sparse Linear Function Approximation: Tight Bounds on Approximation Error

論文の概要: Misspecified $Q$-Learning with Sparse Linear Function Approximation: Tight Bounds on Approximation Error

arxiv url: http://arxiv.org/abs/2407.13622v1
Date: Thu, 18 Jul 2024 15:58:04 GMT
ステータス: 翻訳完了
システム内更新日: 2024-07-19 14:51:10.990110
Title: Misspecified $Q$-Learning with Sparse Linear Function Approximation: Tight Bounds on Approximation Error
Title（参考訳）: Sparse Linear Function Approximation による不特定$Q$-earning: 近似誤差のタイト境界
Authors: Ally Yalei Du, Lin F. Yang, Ruosong Wang,
Abstract要約: 我々は、$Oleft(Hepsilonright)$-optimal Policyを得ることができることを示す新しい除去アルゴリズムを示す。我々は上界を$widetildeOmegaleft(Hepsilonright)$-optimality lower boundで補い、この問題の完全な図面を与える。
参考スコア（独自算出の注目度）: 25.777423855881878
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent work by Dong & Yang (2023) showed for misspecified sparse linear bandits, one can obtain an $O\left(\epsilon\right)$-optimal policy using a polynomial number of samples when the sparsity is a constant, where $\epsilon$ is the misspecification error. This result is in sharp contrast to misspecified linear bandits without sparsity, which require an exponential number of samples to get the same guarantee. In order to study whether the analog result is possible in the reinforcement learning setting, we consider the following problem: assuming the optimal $Q$-function is a $d$-dimensional linear function with sparsity $k$ and misspecification error $\epsilon$, whether we can obtain an $O\left(\epsilon\right)$-optimal policy using number of samples polynomially in the feature dimension $d$. We first demonstrate why the standard approach based on Bellman backup or the existing optimistic value function elimination approach such as OLIVE (Jiang et al., 2017) achieves suboptimal guarantees for this problem. We then design a novel elimination-based algorithm to show one can obtain an $O\left(H\epsilon\right)$-optimal policy with sample complexity polynomially in the feature dimension $d$ and planning horizon $H$. Lastly, we complement our upper bound with an $\widetilde{\Omega}\left(H\epsilon\right)$ suboptimality lower bound, giving a complete picture of this problem.
Abstract（参考訳）: Dong & Yang (2023) による最近の研究は、不特定なスパース線形包帯に対して、空間が定数であるときにサンプルの多項式数を用いて$O\left(\epsilon\right)$-最適化ポリシーを得ることができ、そこで$\epsilon$は不特定誤差である。この結果は、スパーシティーのない不特定線形バンドレットと鋭い対比であり、同じ保証を得るためには指数的な数のサンプルを必要とする。強化学習環境でアナログ結果が可能であるかどうかを調べるために、最適$Q$-函数がスパーシティ$k$と不特定誤差$\epsilon$を持つ$d$次元線型関数であると仮定すると、特徴次元$d$に多項式的なサンプル数を用いた$O\left(\epsilon\right)$-最適ポリシーが得られるかどうかという問題を考える。まず,ベルマンバックアップに基づく標準手法やOLIVE (Jiang et al , 2017) のような既存の楽観的値関数除去手法が,この問題に対する準最適保証を実現する理由を示す。そこで我々は,特徴次元$d$と計画的地平$H$の多項式を持つ$O\left(H\epsilon\right)$-optimal Policyが得られることを示す新しい除去アルゴリズムを設計する。最後に、上界を$\widetilde{\Omega}\left(H\epsilon\right)$ suboptimality lower boundで補い、この問題の完全な図面を与える。

関連論文リスト

On the Interplay Between Misspecification and Sub-optimality Gap in Linear Contextual Bandits [76.2262680277608]
本研究では,線形関数クラスによって期待される報酬関数を近似できるような,不特定条件下での線形文脈帯域について検討する。このアルゴリズムは, 対数的因子に比例した設定において, ギャップ依存の残差が$tilde O (d2/Delta)$と同じであることを示す。
論文参考訳（メタデータ） (2023-03-16T15:24:29Z)
Best Policy Identification in Linear MDPs [70.57916977441262]
縮退した線形マルコフ+デルタ決定における最適同定問題について, 生成モデルに基づく固定信頼度設定における検討を行った。複雑な非最適化プログラムの解としての下位境界は、そのようなアルゴリズムを考案する出発点として用いられる。
論文参考訳（メタデータ） (2022-08-11T04:12:50Z)
Sampling from Log-Concave Distributions with Infinity-Distance Guarantees and Applications to Differentially Private Optimization [33.38289436686841]
本稿では,dis distributionO(varepsilon)$close から$ infinity-distance に点を出力するアルゴリズムを提案する。また、ディキンウォークの「ソフトパイ」バージョンも提示する。
論文参考訳（メタデータ） (2021-11-07T13:44:50Z)
Infinite-Horizon Offline Reinforcement Learning with Linear Function Approximation: Curse of Dimensionality and Algorithm [46.36534144138337]
本稿では,オフライン強化学習におけるポリシー評価のサンプル複雑さについて検討する。低分布シフトの仮定の下では、最大$oleft(maxleft fracleftvert thetapirightvert _24varepsilon4logfracddelta,frac1varepsilon2left(d+logfrac1deltaright)right right)$サンプルを必要とするアルゴリズムがあることを示す。
論文参考訳（メタデータ） (2021-03-17T18:18:57Z)
Private Stochastic Convex Optimization: Optimal Rates in $\ell_1$ Geometry [69.24618367447101]
対数要因まで $(varepsilon,delta)$-differently private の最適過剰人口損失は $sqrtlog(d)/n + sqrtd/varepsilon n.$ です。損失関数がさらなる滑らかさの仮定を満たすとき、余剰損失は$sqrtlog(d)/n + (log(d)/varepsilon n)2/3で上界(対数因子まで)であることが示される。
論文参考訳（メタデータ） (2021-03-02T06:53:44Z)
Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
学習者が生成モデル$Y = langle X,w* rangle + epsilon$から$n$のサンプルにアクセスできるような高次元頑健な線形回帰問題について検討する。 i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance, (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
論文参考訳（メタデータ） (2020-07-16T06:44:44Z)
Sparse Convex Optimization via Adaptively Regularized Hard Thresholding [17.60502131429094]
本稿では,適応正規化ハードThresholding (ARHT) アルゴリズムを提案する。また、OMP with Replacement (OMPR) for general $f$, under the condition $s > s* frackappa24$。
論文参考訳（メタデータ） (2020-06-25T17:16:21Z)
$\lambda$-Regularized A-Optimal Design and its Approximation by $\lambda$-Regularized Proportional Volume Sampling [1.256413718364189]
本稿では,$lambda$-regularized $A$-optimal design problemについて検討し,$lambda$-regularized proportional volume sample algorithmを紹介する。この問題は、リッジ回帰モデルにおける真の係数からリッジ回帰予測器の2乗誤差を最小化しようとする、リッジ回帰の最適設計から動機づけられている。
論文参考訳（メタデータ） (2020-06-19T15:17:57Z)
Maximizing Determinants under Matroid Constraints [69.25768526213689]
我々は、$det(sum_i in Sv_i v_i v_itop)$が最大になるような基底を$S$$$$M$とする問題を研究する。この問題は、実験的なデザイン、商品の公平な割り当て、ネットワーク設計、機械学習など、さまざまな分野に現れている。
論文参考訳（メタデータ） (2020-04-16T19:16:38Z)
Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity [94.37110094442136]
本稿では,決定論的システムにおける関数近似を用いたQ$学習の問題について検討する。もし$delta = Oleft(rho/sqrtdim_Eright)$なら、$Oleft(dim_Eright)$を使って最適なポリシーを見つけることができる。
論文参考訳（メタデータ） (2020-02-17T18:41:49Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。