Improved Regret Analysis for Variance-Adaptive Linear Bandits and
Horizon-Free Linear Mixture MDPs
- URL: http://arxiv.org/abs/2111.03289v1
- Date: Fri, 5 Nov 2021 06:47:27 GMT
- Title: Improved Regret Analysis for Variance-Adaptive Linear Bandits and
Horizon-Free Linear Mixture MDPs
- Authors: Yeoneung Kim, Insoon Yang, Kwang-Sung Jun
- Abstract summary: In online learning problems, exploiting low variance plays an important role in obtaining tight performance guarantees.
We present novel analyses that improve their regret bounds significantly.
Our analysis critically relies on a novel elliptical potential count' lemma.
- Score: 12.450760567361531
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In online learning problems, exploiting low variance plays an important role
in obtaining tight performance guarantees yet is challenging because variances
are often not known a priori. Recently, a considerable progress has been made
by Zhang et al. (2021) where they obtain a variance-adaptive regret bound for
linear bandits without knowledge of the variances and a horizon-free regret
bound for linear mixture Markov decision processes (MDPs). In this paper, we
present novel analyses that improve their regret bounds significantly. For
linear bandits, we achieve $\tilde O(d^{1.5}\sqrt{\sum_{k}^K \sigma_k^2} +
d^2)$ where $d$ is the dimension of the features, $K$ is the time horizon, and
$\sigma_k^2$ is the noise variance at time step $k$, and $\tilde O$ ignores
polylogarithmic dependence, which is a factor of $d^3$ improvement. For linear
mixture MDPs, we achieve a horizon-free regret bound of $\tilde
O(d^{1.5}\sqrt{K} + d^3)$ where $d$ is the number of base models and $K$ is the
number of episodes. This is a factor of $d^3$ improvement in the leading term
and $d^6$ in the lower order term. Our analysis critically relies on a novel
elliptical potential `count' lemma. This lemma allows a peeling-based regret
analysis, which can be of independent interest.
Related papers
- Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit
Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses.
We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z) - Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement
Learning: Adaptivity and Computational Efficiency [90.40062452292091]
We present the first computationally efficient algorithm for linear bandits with heteroscedastic noise.
Our algorithm is adaptive to the unknown variance of noise and achieves an $tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regret.
We also propose a variance-adaptive algorithm for linear mixture Markov decision processes (MDPs) in reinforcement learning.
arXiv Detail & Related papers (2023-02-21T00:17:24Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Variance-Aware Sparse Linear Bandits [64.70681598741417]
Worst-case minimax regret for sparse linear bandits is $widetildeThetaleft(sqrtdTright)$.
In the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve an $widetildemathcal O(1)$ regret.
We develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits.
arXiv Detail & Related papers (2022-05-26T15:55:44Z) - Computationally Efficient Horizon-Free Reinforcement Learning for Linear
Mixture MDPs [111.75736569611159]
We propose the first computationally efficient horizon-free algorithm for linear mixture MDPs.
Our algorithm adapts a weighted least square estimator for the unknown transitional dynamic.
This also improves upon the best-known algorithms in this setting when $sigma_k2$'s are known.
arXiv Detail & Related papers (2022-05-23T17:59:18Z) - Variance-Aware Confidence Set: Variance-Dependent Bound for Linear
Bandits and Horizon-Free Bound for Linear Mixture MDP [76.94328400919836]
We show how to construct variance-aware confidence sets for linear bandits and linear mixture Decision Process (MDP)
For linear bandits, we obtain an $widetildeO(mathrmpoly(d)sqrt1 + sum_i=1Ksigma_i2) regret bound, where $d is the feature dimension.
For linear mixture MDP, we obtain an $widetildeO(mathrmpoly(d)sqrtK)$ regret bound, where
arXiv Detail & Related papers (2021-01-29T18:57:52Z) - On Frequentist Regret of Linear Thompson Sampling [8.071506311915398]
This paper studies the linear bandit problem, where a decision-maker chooses actions from possibly time-dependent sets of vectors in $mathbbRd$ and receives noisy rewards.
The objective is to minimize regret, the difference between the cumulative expected reward of the decision-maker and that of an oracle with access to the expected reward of each action, over a sequence of $T$ decisions.
arXiv Detail & Related papers (2020-06-11T20:19:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.