Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement
Learning: Adaptivity and Computational Efficiency
- URL: http://arxiv.org/abs/2302.10371v1
- Date: Tue, 21 Feb 2023 00:17:24 GMT
- Title: Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement
Learning: Adaptivity and Computational Efficiency
- Authors: Heyang Zhao and Jiafan He and Dongruo Zhou and Tong Zhang and Quanquan
Gu
- Abstract summary: We present the first computationally efficient algorithm for linear bandits with heteroscedastic noise.
Our algorithm is adaptive to the unknown variance of noise and achieves an $tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regret.
We also propose a variance-adaptive algorithm for linear mixture Markov decision processes (MDPs) in reinforcement learning.
- Score: 90.40062452292091
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, several studies (Zhou et al., 2021a; Zhang et al., 2021b; Kim et
al., 2021; Zhou and Gu, 2022) have provided variance-dependent regret bounds
for linear contextual bandits, which interpolates the regret for the worst-case
regime and the deterministic reward regime. However, these algorithms are
either computationally intractable or unable to handle unknown variance of the
noise. In this paper, we present a novel solution to this open problem by
proposing the first computationally efficient algorithm for linear bandits with
heteroscedastic noise. Our algorithm is adaptive to the unknown variance of
noise and achieves an $\tilde{O}(d \sqrt{\sum_{k = 1}^K \sigma_k^2} + d)$
regret, where $\sigma_k^2$ is the variance of the noise at the round $k$, $d$
is the dimension of the contexts and $K$ is the total number of rounds. Our
results are based on an adaptive variance-aware confidence set enabled by a new
Freedman-type concentration inequality for self-normalized martingales and a
multi-layer structure to stratify the context vectors into different layers
with different uniform upper bounds on the uncertainty.
Furthermore, our approach can be extended to linear mixture Markov decision
processes (MDPs) in reinforcement learning. We propose a variance-adaptive
algorithm for linear mixture MDPs, which achieves a problem-dependent
horizon-free regret bound that can gracefully reduce to a nearly constant
regret for deterministic MDPs. Unlike existing nearly minimax optimal
algorithms for linear mixture MDPs, our algorithm does not require explicit
variance estimation of the transitional probabilities or the use of high-order
moment estimators to attain horizon-free regret. We believe the techniques
developed in this paper can have independent value for general online decision
making problems.
Related papers
- Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds.
We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$.
Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z) - Towards Efficient and Optimal Covariance-Adaptive Algorithms for Combinatorial Semi-Bandits [12.674929126684528]
We address the problem of semi-bandits, where a player selects among $P$ actions from the power set of a set containing $d$ base items.
We show that our approach efficiently leverages the semi-bandit feedback and outperforms bandit feedback approaches.
arXiv Detail & Related papers (2024-02-23T08:07:54Z) - Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM)
We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$.
Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z) - Best-of-Three-Worlds Linear Bandit Algorithm with Variance-Adaptive
Regret Bounds [27.92755687977196]
The paper proposes a linear bandit algorithm that is adaptive to environments at two different levels of hierarchy.
At the higher level, the proposed algorithm adapts to a variety of types of environments.
The proposed algorithm has data-dependent regret bounds that depend on all of the cumulative loss for the optimal action.
arXiv Detail & Related papers (2023-02-24T00:02:24Z) - Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both
Worlds in Stochastic and Deterministic Environments [48.96971760679639]
We study variance-dependent regret bounds for Markov decision processes (MDPs)
We propose two new environment norms to characterize the fine-grained variance properties of the environment.
For model-based methods, we design a variant of the MVP algorithm.
In particular, this bound is simultaneously minimax optimal for both and deterministic MDPs.
arXiv Detail & Related papers (2023-01-31T06:54:06Z) - Adversarially Robust Multi-Armed Bandit Algorithm with
Variance-Dependent Regret Bounds [34.37963000493442]
This paper considers the multi-armed bandit (MAB) problem and provides a new best-of-both-worlds (BOBW) algorithm that works nearly optimally in both adversarial settings.
arXiv Detail & Related papers (2022-06-14T12:58:46Z) - Computationally Efficient Horizon-Free Reinforcement Learning for Linear
Mixture MDPs [111.75736569611159]
We propose the first computationally efficient horizon-free algorithm for linear mixture MDPs.
Our algorithm adapts a weighted least square estimator for the unknown transitional dynamic.
This also improves upon the best-known algorithms in this setting when $sigma_k2$'s are known.
arXiv Detail & Related papers (2022-05-23T17:59:18Z) - Large-Scale Methods for Distributionally Robust Optimization [53.98643772533416]
We prove that our algorithms require a number of evaluations gradient independent of training set size and number of parameters.
Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods.
arXiv Detail & Related papers (2020-10-12T17:41:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.