Misspecified $Q$-Learning with Sparse Linear Function Approximation: Tight Bounds on Approximation Error
- URL: http://arxiv.org/abs/2407.13622v1
- Date: Thu, 18 Jul 2024 15:58:04 GMT
- Title: Misspecified $Q$-Learning with Sparse Linear Function Approximation: Tight Bounds on Approximation Error
- Authors: Ally Yalei Du, Lin F. Yang, Ruosong Wang,
- Abstract summary: We show a novel elimination-based algorithm to show one can obtain an $Oleft(Hepsilonright)$-optimal policy.
We complement our upper bound with an $widetildeOmegaleft(Hepsilonright)$-optimality lower bound, giving a complete picture of this problem.
- Score: 25.777423855881878
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent work by Dong & Yang (2023) showed for misspecified sparse linear bandits, one can obtain an $O\left(\epsilon\right)$-optimal policy using a polynomial number of samples when the sparsity is a constant, where $\epsilon$ is the misspecification error. This result is in sharp contrast to misspecified linear bandits without sparsity, which require an exponential number of samples to get the same guarantee. In order to study whether the analog result is possible in the reinforcement learning setting, we consider the following problem: assuming the optimal $Q$-function is a $d$-dimensional linear function with sparsity $k$ and misspecification error $\epsilon$, whether we can obtain an $O\left(\epsilon\right)$-optimal policy using number of samples polynomially in the feature dimension $d$. We first demonstrate why the standard approach based on Bellman backup or the existing optimistic value function elimination approach such as OLIVE (Jiang et al., 2017) achieves suboptimal guarantees for this problem. We then design a novel elimination-based algorithm to show one can obtain an $O\left(H\epsilon\right)$-optimal policy with sample complexity polynomially in the feature dimension $d$ and planning horizon $H$. Lastly, we complement our upper bound with an $\widetilde{\Omega}\left(H\epsilon\right)$ suboptimality lower bound, giving a complete picture of this problem.
Related papers
- On the Interplay Between Misspecification and Sub-optimality Gap in
Linear Contextual Bandits [76.2262680277608]
We study linear contextual bandits in the misspecified setting, where the expected reward function can be approximated by a linear function class.
We show that our algorithm enjoys the same gap-dependent regret bound $tilde O (d2/Delta)$ as in the well-specified setting up to logarithmic factors.
arXiv Detail & Related papers (2023-03-16T15:24:29Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Infinite-Horizon Offline Reinforcement Learning with Linear Function
Approximation: Curse of Dimensionality and Algorithm [46.36534144138337]
In this paper, we investigate the sample complexity of policy evaluation in offline reinforcement learning.
Under the low distribution shift assumption, we show that there is an algorithm that needs at most $Oleft(maxleft fracleftVert thetapirightVert _24varepsilon4logfracddelta,frac1varepsilon2left(d+logfrac1deltaright)right right)$ samples to approximate the
arXiv Detail & Related papers (2021-03-17T18:18:57Z) - Private Stochastic Convex Optimization: Optimal Rates in $\ell_1$
Geometry [69.24618367447101]
Up to logarithmic factors the optimal excess population loss of any $(varepsilon,delta)$-differently private is $sqrtlog(d)/n + sqrtd/varepsilon n.$
We show that when the loss functions satisfy additional smoothness assumptions, the excess loss is upper bounded (up to logarithmic factors) by $sqrtlog(d)/n + (log(d)/varepsilon n)2/3.
arXiv Detail & Related papers (2021-03-02T06:53:44Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z) - Sparse Convex Optimization via Adaptively Regularized Hard Thresholding [17.60502131429094]
We present a new Adaptively Regularized Hard Thresholding (ARHT) algorithm that makes significant progress on this problem.
We also provide a new analysis of OMP with Replacement (OMPR) for general $f$, under the condition $s > s* frackappa24$.
arXiv Detail & Related papers (2020-06-25T17:16:21Z) - $\lambda$-Regularized A-Optimal Design and its Approximation by
$\lambda$-Regularized Proportional Volume Sampling [1.256413718364189]
We study the $lambda$-regularized $A$-optimal design problem and introduce the $lambda$-regularized proportional volume sampling algorithm.
The problem is motivated from optimal design in ridge regression, where one tries to minimize the expected squared error of the ridge regression predictor from the true coefficient in the underlying linear model.
arXiv Detail & Related papers (2020-06-19T15:17:57Z) - Maximizing Determinants under Matroid Constraints [69.25768526213689]
We study the problem of finding a basis $S$ of $M$ such that $det(sum_i in Sv_i v_i v_itop)$ is maximized.
This problem appears in a diverse set of areas such as experimental design, fair allocation of goods, network design, and machine learning.
arXiv Detail & Related papers (2020-04-16T19:16:38Z) - Agnostic Q-learning with Function Approximation in Deterministic
Systems: Tight Bounds on Approximation Error and Sample Complexity [94.37110094442136]
We study the problem of agnostic $Q$-learning with function approximation in deterministic systems.
We show that if $delta = Oleft(rho/sqrtdim_Eright)$, then one can find the optimal policy using $Oleft(dim_Eright)$.
arXiv Detail & Related papers (2020-02-17T18:41:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.