Related papers: Scale-Free Adversarial Multi-Armed Bandit with Arbitrary Feedback Delays

Scale-Free Adversarial Multi-Armed Bandit with Arbitrary Feedback Delays

URL: http://arxiv.org/abs/2110.13400v1
Date: Tue, 26 Oct 2021 04:06:51 GMT
Title: Scale-Free Adversarial Multi-Armed Bandit with Arbitrary Feedback Delays
Authors: Jiatai Huang, Yan Dai, Longbo Huang
Abstract summary: We consider the Scale-Free Adversarial Multi Armed Bandit (MAB) problem with unrestricted feedback delays. textttSFBanker achieves $mathcal O(sqrtK(D+T)L)cdot rm polylog(T, L)$ total regret, where $T$ is the total number of steps and $D$ is the total feedback delay.
Score: 21.94728545221709
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We consider the Scale-Free Adversarial Multi Armed Bandit (MAB) problem with unrestricted feedback delays. In contrast to the standard assumption that all losses are $[0,1]$-bounded, in our setting, losses can fall in a general bounded interval $[-L, L]$, unknown to the agent before-hand. Furthermore, the feedback of each arm pull can experience arbitrary delays. We propose an algorithm named \texttt{SFBanker} for this novel setting, which combines a recent banker online mirror descent technique and elaborately designed doubling tricks. We show that \texttt{SFBanker} achieves $\mathcal O(\sqrt{K(D+T)}L)\cdot {\rm polylog}(T, L)$ total regret, where $T$ is the total number of steps and $D$ is the total feedback delay. \texttt{SFBanker} also outperforms existing algorithm for non-delayed (i.e., $D=0$) scale-free adversarial MAB problem instances. We also present a variant of \texttt{SFBanker} for problem instances with non-negative losses (i.e., they range in $[0, L]$ for some unknown $L$), achieving an $\tilde{\mathcal O}(\sqrt{K(D+T)}L)$ total regret, which is near-optimal compared to the $\Omega(\sqrt{KT}+\sqrt{D\log K}L)$ lower-bound ([Cesa-Bianchi et al., 2016]).

Related papers

Follow-the-Perturbed-Leader Approaches Best-of-Both-Worlds for the m-Set Semi-Bandit Problems [20.47953854427799]
We consider a common case of the semi-bandit problem, where the learner exactly selects $m$ arms from the total $d$ arms. We show that FTPL with a Fr'echet perturbation also enjoys the near optimal regret bound $mathcalO(sqrtnmdlog(d))$ in the adversarial setting.
arXiv Detail & Related papers (2025-04-09T22:07:01Z)
Variance-Dependent Regret Lower Bounds for Contextual Bandits [65.93854043353328]
Variance-dependent regret bounds for linear contextual bandits, which improve upon the classical $tildeO(dsqrtK)$ regret bound to $tildeO(dsqrtsum_k=1Ksigma_k2)$.
arXiv Detail & Related papers (2025-03-15T07:09:36Z)
Batched Stochastic Bandit for Nondegenerate Functions [8.015503209312786]
This paper studies batched bandit learning problems for nondegenerate functions. We introduce an algorithm that solves the batched bandit problem for nondegenerate functions near-optimally.
arXiv Detail & Related papers (2024-05-09T12:50:16Z)
LC-Tsallis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits [38.41164102066483]
This study considers the linear contextual bandit problem with independent and identically distributed contexts. Our proposed algorithm is based on the Follow-The-Regularized-Leader with the Tsallis entropy and referred to as the $alpha$-textual-Con (LC)-Tsallis-INF.
arXiv Detail & Related papers (2024-03-05T18:59:47Z)
Improved Regret for Bandit Convex Optimization with Delayed Feedback [50.46856739179311]
bandit convex optimization (BCO) with delayed feedback, where only the loss value of the action is revealed under a delay. We develop a novel algorithm, and prove that it enjoys a regret bound of $O(sqrtnT3/4+sqrtdT)$ in general. We show that the proposed algorithm can improve the regret bound to $O((nT)2/3log/3T+dlog T)$ for strongly convex functions.
arXiv Detail & Related papers (2024-02-14T13:08:26Z)
Federated Linear Bandits with Finite Adversarial Actions [20.1041278044797]
We study a federated linear bandits model, where $M$ clients communicate with a central server to solve a linear contextual bandits problem. To address the unique challenges of adversarial finite action sets, we propose the FedSupLinUCB algorithm. We prove that FedSupLinUCB achieves a total regret of $tildeO(sqrtd T)$, where $T$ is the total number of arm pulls from all clients, and $d$ is the ambient dimension of the linear model.
arXiv Detail & Related papers (2023-11-02T03:41:58Z)
A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback [25.68113242132723]
We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial multiarmed bandits with delayed feedback. We simultaneously achieves a near-optimal adversarial regret guarantee in the setting with fixed delays. We also present an extension of the algorithm to the case of arbitrary delays.
arXiv Detail & Related papers (2022-06-29T20:49:45Z)
Corralling a Larger Band of Bandits: A Case Study on Switching Regret for Linear Bandits [99.86860277006318]
We consider the problem of combining and learning over a set of adversarial algorithms with the goal of adaptively tracking the best one on the fly. The CORRAL of Agarwal et al. achieves this goal with a regret overhead of order $widetildeO(sqrtd S T)$ where $M$ is the number of base algorithms and $T$ is the time horizon. Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only emphlogarithmic dependence on $M$ as long
arXiv Detail & Related papers (2022-02-12T21:55:44Z)
On Submodular Contextual Bandits [92.45432756301231]
We consider the problem of contextual bandits where actions are subsets of a ground set and mean rewards are modeled by an unknown monotone submodular function. We show that our algorithm efficiently randomizes around local optima of estimated functions according to the Inverse Gap Weighting strategy.
arXiv Detail & Related papers (2021-12-03T21:42:33Z)
Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions admit a "pseudo-1d" structure. We show a lower bound of $min(sqrtdT, T3/4)$ for the regret of any algorithm, where $T$ is the number of rounds. We propose a new algorithm sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively.
arXiv Detail & Related papers (2021-02-15T08:16:51Z)
Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints. For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z)
Thresholded Lasso Bandit [70.17389393497125]
Thresholded Lasso bandit is an algorithm that estimates the vector defining the reward function as well as its sparse support. We establish non-asymptotic regret upper bounds scaling as $mathcalO( log d + sqrtT )$ in general, and as $mathcalO( log d + sqrtT )$ under the so-called margin condition.
arXiv Detail & Related papers (2020-10-22T19:14:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.