Fast UCB-type algorithms for stochastic bandits with heavy and super
heavy symmetric noise
- URL: http://arxiv.org/abs/2402.07062v1
- Date: Sat, 10 Feb 2024 22:38:21 GMT
- Title: Fast UCB-type algorithms for stochastic bandits with heavy and super
heavy symmetric noise
- Authors: Yuriy Dorn, Aleksandr Katrutsa, Ilgam Latypov, Andrey Pudovikov
- Abstract summary: We propose a new algorithm for constructing UCB-type algorithms for multi-armed bandits.
We show that in the case of symmetric noise in the reward, we can achieve an $O(log TsqrtKTlog T)$ regret bound instead of $Oleft.
- Score: 45.60098988395789
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this study, we propose a new method for constructing UCB-type algorithms
for stochastic multi-armed bandits based on general convex optimization methods
with an inexact oracle. We derive the regret bounds corresponding to the
convergence rates of the optimization methods. We propose a new algorithm
Clipped-SGD-UCB and show, both theoretically and empirically, that in the case
of symmetric noise in the reward, we can achieve an $O(\log T\sqrt{KT\log T})$
regret bound instead of $O\left (T^{\frac{1}{1+\alpha}}
K^{\frac{\alpha}{1+\alpha}} \right)$ for the case when the reward distribution
satisfies $\mathbb{E}_{X \in D}[|X|^{1+\alpha}] \leq \sigma^{1+\alpha}$
($\alpha \in (0, 1])$, i.e. perform better than it is assumed by the general
lower bound for bandits with heavy-tails. Moreover, the same bound holds even
when the reward distribution does not have the expectation, that is, when
$\alpha<0$.
Related papers
- Nearly Optimal Sample Complexity of Offline KL-Regularized Contextual Bandits under Single-Policy Concentrability [49.96531901205305]
We propose the emphfirst algorithm with $tildeO(epsilon-1)$ sample complexity under single-policy concentrability for offline contextual bandits.
Our proof leverages the strong convexity of the KL regularization, and the conditional non-negativity of the gap between the true reward and its pessimistic estimator.
We extend our algorithm to contextual dueling bandits and achieve a similar nearly optimal sample complexity.
arXiv Detail & Related papers (2025-02-09T22:14:45Z) - Nonconvex Stochastic Optimization under Heavy-Tailed Noises: Optimal Convergence without Gradient Clipping [21.865728815935665]
We provide the first convergence under heavy-tailed noises but without clipping.
We also establish first $mathcalO(Tfrac1-mathfrakp3mathfrakp-2)$ convergence rate in the case where the tail index $mathfrakp$ is unknown in advance.
arXiv Detail & Related papers (2024-12-27T08:46:46Z) - p-Mean Regret for Stochastic Bandits [52.828710025519996]
We introduce a simple, unified UCB-based algorithm that achieves novel $p$-mean regret bounds.
Our framework encompasses both average cumulative regret and Nash regret as special cases.
arXiv Detail & Related papers (2024-12-14T08:38:26Z) - Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed.
SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z) - Generalized Regret Analysis of Thompson Sampling using Fractional
Posteriors [12.43000662545423]
Thompson sampling (TS) is one of the most popular and earliest algorithms to solve multi-armed bandit problems.
We consider a variant of TS, named $alpha$-TS, where we use a fractional or $alpha$-posterior instead of the standard posterior distribution.
arXiv Detail & Related papers (2023-09-12T16:15:33Z) - Adaptive Best-of-Both-Worlds Algorithm for Heavy-Tailed Multi-Armed
Bandits [22.18577284185939]
We develop robust best-of-both-worlds algorithms for heavy-tailed multi-armed bandits.
textttHTINF simultaneously achieves the optimal regret for both and adversarial environments.
textttAdaTINF is the first algorithm that can adapt to both $alpha$ and $sigma$ to achieve optimal gap-indepedent regret bound.
arXiv Detail & Related papers (2022-01-28T03:53:39Z) - Gaussian Process Bandit Optimization with Few Batches [49.896920704012395]
We introduce a batch algorithm inspired by finite-arm bandit algorithms.
We show that it achieves the cumulative regret upper bound $Oast(sqrtTgamma_T)$ using $O(loglog T)$ batches within time horizon $T$.
In addition, we propose a modified version of our algorithm, and characterize how the regret is impacted by the number of batches.
arXiv Detail & Related papers (2021-10-15T00:54:04Z) - Smooth Bandit Optimization: Generalization to H\"older Space [37.15553727896912]
We consider bandit optimization of a smooth reward function, where the goal is cumulative regret.
Our main result is in generalization of the reward function to H"older space with exponent $alpha>1$.
We show that it achieves regret rate that matches the existing lower bound for adaptation within the $alphaleq 1$ subset.
arXiv Detail & Related papers (2020-12-11T01:43:25Z) - Thresholded Lasso Bandit [70.17389393497125]
Thresholded Lasso bandit is an algorithm that estimates the vector defining the reward function as well as its sparse support.
We establish non-asymptotic regret upper bounds scaling as $mathcalO( log d + sqrtT )$ in general, and as $mathcalO( log d + sqrtT )$ under the so-called margin condition.
arXiv Detail & Related papers (2020-10-22T19:14:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.