Variance-Aware Confidence Set: Variance-Dependent Bound for Linear
Bandits and Horizon-Free Bound for Linear Mixture MDP
- URL: http://arxiv.org/abs/2101.12745v2
- Date: Fri, 19 Feb 2021 15:46:46 GMT
- Title: Variance-Aware Confidence Set: Variance-Dependent Bound for Linear
Bandits and Horizon-Free Bound for Linear Mixture MDP
- Authors: Zihan Zhang, Jiaqi Yang, Xiangyang Ji, Simon S. Du
- Abstract summary: We show how to construct variance-aware confidence sets for linear bandits and linear mixture Decision Process (MDP)
For linear bandits, we obtain an $widetildeO(mathrmpoly(d)sqrt1 + sum_i=1Ksigma_i2) regret bound, where $d is the feature dimension.
For linear mixture MDP, we obtain an $widetildeO(mathrmpoly(d)sqrtK)$ regret bound, where
- Score: 76.94328400919836
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We show how to construct variance-aware confidence sets for linear bandits
and linear mixture Markov Decision Process (MDP). Our method yields the
following new regret bounds:
* For linear bandits, we obtain an $\widetilde{O}(\mathrm{poly}(d)\sqrt{1 +
\sum_{i=1}^{K}\sigma_i^2})$ regret bound, where $d$ is the feature dimension,
$K$ is the number of rounds, and $\sigma_i^2$ is the (unknown) variance of the
reward at the $i$-th round. This is the first regret bound that only scales
with the variance and the dimension, with no explicit polynomial dependency on
$K$.
* For linear mixture MDP, we obtain an $\widetilde{O}(\mathrm{poly}(d, \log
H)\sqrt{K})$ regret bound, where $d$ is the number of base models, $K$ is the
number of episodes, and $H$ is the planning horizon. This is the first regret
bound that only scales logarithmically with $H$ in the reinforcement learning
with linear function approximation setting, thus exponentially improving
existing results.
Our methods utilize three novel ideas that may be of independent interest: 1)
applications of the peeling techniques to the norm of input and the magnitude
of variance, 2) a recursion-based approach to estimate the variance, and 3) a
convex potential lemma that somewhat generalizes the seminal elliptical
potential lemma.
Related papers
- Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit
Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses.
We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z) - Variance-aware robust reinforcement learning with linear function
approximation with heavy-tailed rewards [6.932056534450556]
We present two algorithms, AdaOFUL and VARA, for online sequential decision-making in the presence of heavy-tailed rewards.
AdaOFUL achieves a state-of-the-art regret bound of $widetildemathcalObig.
VarA achieves a tighter variance-aware regret bound of $widetildemathcalO(dsqrtHmathcalG*K)$.
arXiv Detail & Related papers (2023-03-09T22:16:28Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Variance-Aware Sparse Linear Bandits [64.70681598741417]
Worst-case minimax regret for sparse linear bandits is $widetildeThetaleft(sqrtdTright)$.
In the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve an $widetildemathcal O(1)$ regret.
We develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits.
arXiv Detail & Related papers (2022-05-26T15:55:44Z) - Computationally Efficient Horizon-Free Reinforcement Learning for Linear
Mixture MDPs [111.75736569611159]
We propose the first computationally efficient horizon-free algorithm for linear mixture MDPs.
Our algorithm adapts a weighted least square estimator for the unknown transitional dynamic.
This also improves upon the best-known algorithms in this setting when $sigma_k2$'s are known.
arXiv Detail & Related papers (2022-05-23T17:59:18Z) - On Submodular Contextual Bandits [92.45432756301231]
We consider the problem of contextual bandits where actions are subsets of a ground set and mean rewards are modeled by an unknown monotone submodular function.
We show that our algorithm efficiently randomizes around local optima of estimated functions according to the Inverse Gap Weighting strategy.
arXiv Detail & Related papers (2021-12-03T21:42:33Z) - Improved Regret Analysis for Variance-Adaptive Linear Bandits and
Horizon-Free Linear Mixture MDPs [12.450760567361531]
In online learning problems, exploiting low variance plays an important role in obtaining tight performance guarantees.
We present novel analyses that improve their regret bounds significantly.
Our analysis critically relies on a novel elliptical potential count' lemma.
arXiv Detail & Related papers (2021-11-05T06:47:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.