Heavy-tailed Linear Bandits: Adversarial Robustness, Best-of-both-worlds, and Beyond
- URL: http://arxiv.org/abs/2508.13679v1
- Date: Tue, 19 Aug 2025 09:28:51 GMT
- Title: Heavy-tailed Linear Bandits: Adversarial Robustness, Best-of-both-worlds, and Beyond
- Authors: Canzhe Zhao, Shinji Ito, Shuai Li,
- Abstract summary: We propose a general framework for adversarial heavy-tailed bandit problems.<n>We devise the first FTRL-type best-of-both-worlds (BOBW) algorithm for heavy-tailed MABs.<n>We also propose a general data-dependent learning rate, termed textitheavy-tailed noise aware stability-penalty matching (HT-SPM)
- Score: 36.287082425317934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Heavy-tailed bandits have been extensively studied since the seminal work of \citet{Bubeck2012BanditsWH}. In particular, heavy-tailed linear bandits, enabling efficient learning with both a large number of arms and heavy-tailed noises, have recently attracted significant attention \citep{ShaoYKL18,XueWWZ20,ZhongHYW21,Wang2025heavy,tajdini2025improved}. However, prior studies focus almost exclusively on stochastic regimes, with few exceptions limited to the special case of heavy-tailed multi-armed bandits (MABs) \citep{Huang0H22,ChengZ024,Chen2024uniINF}. In this work, we propose a general framework for adversarial heavy-tailed bandit problems, which performs follow-the-regularized-leader (FTRL) over the loss estimates shifted by a bonus function. Via a delicate setup of the bonus function, we devise the first FTRL-type best-of-both-worlds (BOBW) algorithm for heavy-tailed MABs, which does not require the truncated non-negativity assumption and achieves an $\widetilde{O}(T^{\frac{1}{\varepsilon}})$ worst-case regret in the adversarial regime as well as an $\widetilde{O}(\log T)$ gap-dependent regret in the stochastic regime. We then extend our framework to the linear case, proposing the first algorithm for adversarial heavy-tailed linear bandits with finite arm sets. This algorithm achieves an $\widetilde{O}(d^{\frac{1}{2}}T^{\frac{1}{\varepsilon}})$ regret, matching the best-known worst-case regret bound in stochastic regimes. Moreover, we propose a general data-dependent learning rate, termed \textit{heavy-tailed noise aware stability-penalty matching} (HT-SPM). We prove that HT-SPM guarantees BOBW regret bounds for general heavy-tailed bandit problems once certain conditions are satisfied. By using HT-SPM and, in particular, a variance-reduced linear loss estimator, we obtain the first BOBW result for heavy-tailed linear bandits.
Related papers
- Regularized Online RLHF with Generalized Bilinear Preferences [68.44113000390544]
We consider the problem of contextual online RLHF with general preferences.<n>We adopt the Generalized Bilinear Preference Model to capture preferences via low-rank, skew-symmetric matrices.<n>We prove that the dual gap of the greedy policy is bounded by the square of the estimation error.
arXiv Detail & Related papers (2026-02-26T15:27:53Z) - Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback [61.49239204705301]
We study online learning in finite-horizon episodic Markov decision processes (MDPs) under the challenging aggregate bandit feedback model.<n>Our results rely on a combination of FTRL over occupancy measures, self-bounding techniques, and new loss estimators inspired by recent advances in online shortest path problems.
arXiv Detail & Related papers (2025-10-20T02:28:08Z) - Best-of-Both Worlds for linear contextual bandits with paid observations [16.13456643813766]
We introduce a computationally efficient Best-of-Both-Worlds (BOBW) algorithm for this problem.<n>We show that it achieves the minimax-optimal regret of $Theta(T2/3)$ in adversarial settings, while guaranteeing poly-logarithmic regret in (corrupted) regimes.
arXiv Detail & Related papers (2025-10-08T18:23:37Z) - Influential Bandits: Pulling an Arm May Change the Environment [44.71145269686588]
Real-world applications often involve non-stationary environments and interdependencies between arms.<n>We propose the influential bandit problem, which models inter-arm interactions through an unknown, symmetric, positive semi-definite interaction matrix.<n>We introduce a new algorithm based on a lower confidence bound (LCB) estimator tailored to the structure of the loss dynamics.
arXiv Detail & Related papers (2025-04-11T02:05:51Z) - Continuous K-Max Bandits [54.21533414838677]
We study the $K$-Max multi-armed bandits problem with continuous outcome distributions and weak value-index feedback.<n>This setting captures critical applications in recommendation systems, distributed computing, server scheduling, etc.<n>Our key contribution is the computationally efficient algorithm DCK-UCB, which combines adaptive discretization with bias-corrected confidence bounds.
arXiv Detail & Related papers (2025-02-19T06:37:37Z) - Implicitly normalized forecaster with clipping for linear and non-linear
heavy-tailed multi-armed bandits [85.27420062094086]
Implicitly Normalized Forecaster (INF) is considered an optimal solution for adversarial multi-armed bandit (MAB) problems.
We propose a new version of INF called the Implicitly Normalized Forecaster with clipping (INFclip) for MAB problems with heavy-tailed settings.
We demonstrate that INFclip is optimal for linear heavy-tailed MAB problems and works well for non-linear ones.
arXiv Detail & Related papers (2023-05-11T12:00:43Z) - On Kernelized Multi-Armed Bandits with Constraints [16.102401271318012]
We study a bandit problem with a general unknown reward function and a general unknown constraint function.
We propose a general framework for both algorithm performance analysis.
We demonstrate the superior performance of our proposed algorithms via numerical experiments.
arXiv Detail & Related papers (2022-03-29T14:02:03Z) - Optimal Online Generalized Linear Regression with Stochastic Noise and
Its Application to Heteroscedastic Bandits [88.6139446295537]
We study the problem of online generalized linear regression in the setting of a generalized linear model with possibly unbounded additive noise.
We provide a sharp analysis of the classical follow-the-regularized-leader (FTRL) algorithm to cope with the label noise.
We propose an algorithm based on FTRL to achieve the first variance-aware regret bound.
arXiv Detail & Related papers (2022-02-28T08:25:26Z) - Versatile Dueling Bandits: Best-of-both-World Analyses for Online
Learning from Preferences [28.79598714109439]
We study the problem of $K$-armed dueling bandit for both and adversarial environments.
We first propose a novel reduction from any (general) dueling bandits to multi-armed bandits.
Our algorithm is also the first to achieve an optimal $O(sum_i = 1K fraclog TDelta_i)$ regret bound against the Condorcet-winner benchmark.
arXiv Detail & Related papers (2022-02-14T13:37:23Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - The Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms [10.662105162882526]
We investigate a $k$-armed bandit problem in the emphmany-armed regime.
Our findings suggest a new form of emphfree exploration beneficial to greedy algorithms in the many-armed context.
arXiv Detail & Related papers (2020-02-24T08:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.