Adversarially Robust Multi-Armed Bandit Algorithm with
Variance-Dependent Regret Bounds
- URL: http://arxiv.org/abs/2206.06810v1
- Date: Tue, 14 Jun 2022 12:58:46 GMT
- Title: Adversarially Robust Multi-Armed Bandit Algorithm with
Variance-Dependent Regret Bounds
- Authors: Shinji Ito, Taira Tsuchiya, Junya Honda
- Abstract summary: This paper considers the multi-armed bandit (MAB) problem and provides a new best-of-both-worlds (BOBW) algorithm that works nearly optimally in both adversarial settings.
- Score: 34.37963000493442
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper considers the multi-armed bandit (MAB) problem and provides a new
best-of-both-worlds (BOBW) algorithm that works nearly optimally in both
stochastic and adversarial settings. In stochastic settings, some existing BOBW
algorithms achieve tight gap-dependent regret bounds of $O(\sum_{i: \Delta_i>0}
\frac{\log T}{\Delta_i})$ for suboptimality gap $\Delta_i$ of arm $i$ and time
horizon $T$. As Audibert et al. [2007] have shown, however, that the
performance can be improved in stochastic environments with low-variance arms.
In fact, they have provided a stochastic MAB algorithm with
gap-variance-dependent regret bounds of $O(\sum_{i: \Delta_i>0}
(\frac{\sigma_i^2}{\Delta_i} + 1) \log T )$ for loss variance $\sigma_i^2$ of
arm $i$. In this paper, we propose the first BOBW algorithm with
gap-variance-dependent bounds, showing that the variance information can be
used even in the possibly adversarial environment. Further, the leading
constant factor in our gap-variance dependent bound is only (almost) twice the
value for the lower bound. Additionally, the proposed algorithm enjoys multiple
data-dependent regret bounds in adversarial settings and works well in
stochastic settings with adversarial corruptions. The proposed algorithm is
based on the follow-the-regularized-leader method and employs adaptive learning
rates that depend on the empirical prediction error of the loss, which leads to
gap-variance-dependent regret bounds reflecting the variance of the arms.
Related papers
- Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds.
We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$.
Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z) - Towards Efficient and Optimal Covariance-Adaptive Algorithms for Combinatorial Semi-Bandits [12.674929126684528]
We address the problem of semi-bandits, where a player selects among $P$ actions from the power set of a set containing $d$ base items.
We show that our approach efficiently leverages the semi-bandit feedback and outperforms bandit feedback approaches.
arXiv Detail & Related papers (2024-02-23T08:07:54Z) - Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed.
SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z) - Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM)
We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$.
Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z) - Contextual Combinatorial Bandits with Probabilistically Triggered Arms [45.305256056479045]
We study contextual bandits with probabilistically triggered arms (C$2$MAB-T) under a variety of smoothness conditions.
Under the triggering modulated (TPM) condition, we devise the C$2$-UC-T algorithm and derive a regret bound $tildeO(dsqrtT)$.
arXiv Detail & Related papers (2023-03-30T02:51:00Z) - Best-of-Three-Worlds Linear Bandit Algorithm with Variance-Adaptive
Regret Bounds [27.92755687977196]
The paper proposes a linear bandit algorithm that is adaptive to environments at two different levels of hierarchy.
At the higher level, the proposed algorithm adapts to a variety of types of environments.
The proposed algorithm has data-dependent regret bounds that depend on all of the cumulative loss for the optimal action.
arXiv Detail & Related papers (2023-02-24T00:02:24Z) - Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement
Learning: Adaptivity and Computational Efficiency [90.40062452292091]
We present the first computationally efficient algorithm for linear bandits with heteroscedastic noise.
Our algorithm is adaptive to the unknown variance of noise and achieves an $tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regret.
We also propose a variance-adaptive algorithm for linear mixture Markov decision processes (MDPs) in reinforcement learning.
arXiv Detail & Related papers (2023-02-21T00:17:24Z) - Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both
Worlds in Stochastic and Deterministic Environments [48.96971760679639]
We study variance-dependent regret bounds for Markov decision processes (MDPs)
We propose two new environment norms to characterize the fine-grained variance properties of the environment.
For model-based methods, we design a variant of the MVP algorithm.
In particular, this bound is simultaneously minimax optimal for both and deterministic MDPs.
arXiv Detail & Related papers (2023-01-31T06:54:06Z) - Variance-Aware Sparse Linear Bandits [64.70681598741417]
Worst-case minimax regret for sparse linear bandits is $widetildeThetaleft(sqrtdTright)$.
In the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve an $widetildemathcal O(1)$ regret.
We develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits.
arXiv Detail & Related papers (2022-05-26T15:55:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.