On the Interplay Between Misspecification and Sub-optimality Gap in
Linear Contextual Bandits
- URL: http://arxiv.org/abs/2303.09390v1
- Date: Thu, 16 Mar 2023 15:24:29 GMT
- Title: On the Interplay Between Misspecification and Sub-optimality Gap in
Linear Contextual Bandits
- Authors: Weitong Zhang and Jiafan He and Zhiyuan Fan and Quanquan Gu
- Abstract summary: We study linear contextual bandits in the misspecified setting, where the expected reward function can be approximated by a linear function class.
We show that our algorithm enjoys the same gap-dependent regret bound $tilde O (d2/Delta)$ as in the well-specified setting up to logarithmic factors.
- Score: 76.2262680277608
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study linear contextual bandits in the misspecified setting, where the
expected reward function can be approximated by a linear function class up to a
bounded misspecification level $\zeta>0$. We propose an algorithm based on a
novel data selection scheme, which only selects the contextual vectors with
large uncertainty for online regression. We show that, when the
misspecification level $\zeta$ is dominated by $\tilde O (\Delta / \sqrt{d})$
with $\Delta$ being the minimal sub-optimality gap and $d$ being the dimension
of the contextual vectors, our algorithm enjoys the same gap-dependent regret
bound $\tilde O (d^2/\Delta)$ as in the well-specified setting up to
logarithmic factors. In addition, we show that an existing algorithm SupLinUCB
(Chu et al., 2011) can also achieve a gap-dependent constant regret bound
without the knowledge of sub-optimality gap $\Delta$. Together with a lower
bound adapted from Lattimore et al. (2020), our result suggests an interplay
between misspecification level and the sub-optimality gap: (1) the linear
contextual bandit model is efficiently learnable when $\zeta \leq \tilde
O(\Delta / \sqrt{d})$; and (2) it is not efficiently learnable when $\zeta \geq
\tilde \Omega({\Delta} / {\sqrt{d}})$. Experiments on both synthetic and
real-world datasets corroborate our theoretical results.
Related papers
- Misspecified $Q$-Learning with Sparse Linear Function Approximation: Tight Bounds on Approximation Error [25.777423855881878]
We show a novel elimination-based algorithm to show one can obtain an $Oleft(Hepsilonright)$-optimal policy.
We complement our upper bound with an $widetildeOmegaleft(Hepsilonright)$-optimality lower bound, giving a complete picture of this problem.
arXiv Detail & Related papers (2024-07-18T15:58:04Z) - LC-Tsallis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits [38.41164102066483]
This study considers the linear contextual bandit problem with independent and identically distributed contexts.
Our proposed algorithm is based on the Follow-The-Regularized-Leader with the Tsallis entropy and referred to as the $alpha$-textual-Con (LC)-Tsallis-INF.
arXiv Detail & Related papers (2024-03-05T18:59:47Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Nearly Minimax Optimal Reinforcement Learning with Linear Function
Approximation [25.60689712525918]
We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear.
We propose a novel-efficient algorithm, LSVI-UCB$+$, which achieves an $widetildeO(HdsqrtT)$ regret bound where $H$ is the episode length, $d$ is the feature dimension, and $T$ is the number of steps.
arXiv Detail & Related papers (2022-06-23T06:04:21Z) - Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive
Multi-Step Bootstrap [84.66885506098724]
This paper presents a new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB)
We show AMB achieves a gap-dependent regret bound that only scales with the sum of the inverse of the sub-optimality gaps.
We also show AMB suffers an additional $frac|Z_mul|Delta_min$ regret, where $Z_mul$ is the set of state-action pairs $(s,a)$'s satisfying $a$ is a non-unique optimal action for
arXiv Detail & Related papers (2021-02-09T07:46:34Z) - Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits
with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints.
For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov
Decision Processes [91.38793800392108]
We study reinforcement learning with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model.
We propose a new, computationally efficient algorithm with linear function approximation named $textUCRL-VTR+$ for the aforementioned linear mixture MDPs.
To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.
arXiv Detail & Related papers (2020-12-15T18:56:46Z) - Smooth Bandit Optimization: Generalization to H\"older Space [37.15553727896912]
We consider bandit optimization of a smooth reward function, where the goal is cumulative regret.
Our main result is in generalization of the reward function to H"older space with exponent $alpha>1$.
We show that it achieves regret rate that matches the existing lower bound for adaptation within the $alphaleq 1$ subset.
arXiv Detail & Related papers (2020-12-11T01:43:25Z) - Agnostic Q-learning with Function Approximation in Deterministic
Systems: Tight Bounds on Approximation Error and Sample Complexity [94.37110094442136]
We study the problem of agnostic $Q$-learning with function approximation in deterministic systems.
We show that if $delta = Oleft(rho/sqrtdim_Eright)$, then one can find the optimal policy using $Oleft(dim_Eright)$.
arXiv Detail & Related papers (2020-02-17T18:41:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.