Batched Bandits with Crowd Externalities
- URL: http://arxiv.org/abs/2109.14733v1
- Date: Wed, 29 Sep 2021 21:43:16 GMT
- Title: Batched Bandits with Crowd Externalities
- Authors: Romain Laroche, Othmane Safsafi, Raphael Feraud, Nicolas Broutin
- Abstract summary: We describe a novel setting for Batched Multi-Armed Bandits (BMAB)
The timing of the policy update is not controlled by the BMAB algorithm, but instead the amount of data received during each batch, called textitcrowd, is influenced by the past selection of arms.
- Score: 11.876737078654976
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In Batched Multi-Armed Bandits (BMAB), the policy is not allowed to be
updated at each time step. Usually, the setting asserts a maximum number of
allowed policy updates and the algorithm schedules them so that to minimize the
expected regret. In this paper, we describe a novel setting for BMAB, with the
following twist: the timing of the policy update is not controlled by the BMAB
algorithm, but instead the amount of data received during each batch, called
\textit{crowd}, is influenced by the past selection of arms. We first design a
near-optimal policy with approximate knowledge of the parameters that we prove
to have a regret in $\mathcal{O}(\sqrt{\frac{\ln x}{x}}+\epsilon)$ where $x$ is
the size of the crowd and $\epsilon$ is the parameter error. Next, we implement
a UCB-inspired algorithm that guarantees an additional regret in
$\mathcal{O}\left(\max(K\ln T,\sqrt{T\ln T})\right)$, where $K$ is the number
of arms and $T$ is the horizon.
Related papers
- Generalized Linear Bandits with Limited Adaptivity [12.112051468737596]
We study the generalized linear contextual bandit problem within the constraints of limited adaptivity.
We present two algorithms, $textttB-GLinCB$ and $textttRS-GLinCB$, that address, respectively, two prevalent limited adaptivity settings.
arXiv Detail & Related papers (2024-04-10T08:47:57Z) - MNL-Bandit in non-stationary environments [2.7737587389928793]
We present an algorithm with a worst-case expected regret of $tildeOleft( min left sqrtNTL;,; Nfrac13(Delta_inftyK)frac13 Tfrac23 + sqrtNTrightright)$.
In particular, we give a tight characterization for the bias introduced in the estimators due to non stationarity and derive new concentration bounds.
arXiv Detail & Related papers (2023-03-04T21:10:42Z) - Complete Policy Regret Bounds for Tallying Bandits [51.039677652803675]
Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary.
We study restrictions on the adversary that enable efficient minimization of the emphcomplete policy regret
We provide an algorithm that w.h.p a complete policy regret guarantee of $tildemathcalO(mKsqrtT)$, where the $tildemathcalO$ notation hides only logarithmic factors.
arXiv Detail & Related papers (2022-04-24T03:10:27Z) - Horizon-Free Reinforcement Learning in Polynomial Time: the Power of
Stationary Policies [88.75843804630772]
We design an algorithm that achieves an $Oleft(mathrmpoly(S,A,log K)sqrtKright)$ regret in contrast to existing bounds.
Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies.
arXiv Detail & Related papers (2022-03-24T08:14:12Z) - Corralling a Larger Band of Bandits: A Case Study on Switching Regret
for Linear Bandits [99.86860277006318]
We consider the problem of combining and learning over a set of adversarial algorithms with the goal of adaptively tracking the best one on the fly.
The CORRAL of Agarwal et al. achieves this goal with a regret overhead of order $widetildeO(sqrtd S T)$ where $M$ is the number of base algorithms and $T$ is the time horizon.
Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only emphlogarithmic dependence on $M$ as long
arXiv Detail & Related papers (2022-02-12T21:55:44Z) - Scale-Free Adversarial Multi-Armed Bandit with Arbitrary Feedback Delays [21.94728545221709]
We consider the Scale-Free Adversarial Multi Armed Bandit (MAB) problem with unrestricted feedback delays.
textttSFBanker achieves $mathcal O(sqrtK(D+T)L)cdot rm polylog(T, L)$ total regret, where $T$ is the total number of steps and $D$ is the total feedback delay.
arXiv Detail & Related papers (2021-10-26T04:06:51Z) - Minimal Expected Regret in Linear Quadratic Control [79.81807680370677]
We devise an online learning algorithm and provide guarantees on its expected regret.
This regret at time $T$ is upper bounded (i) by $widetildeO((d_u+d_x)sqrtd_xT)$ when $A$ and $B$ are unknown.
arXiv Detail & Related papers (2021-09-29T14:07:21Z) - Bandits with many optimal arms [68.17472536610859]
We write $p*$ for the proportion of optimal arms and $Delta$ for the minimal mean-gap between optimal and sub-optimal arms.
We characterize the optimal learning rates both in the cumulative regret setting, and in the best-arm identification setting.
arXiv Detail & Related papers (2021-03-23T11:02:31Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - (Locally) Differentially Private Combinatorial Semi-Bandits [26.462686161971803]
We study Combinatorial Semi-Bandits (CSB) that is an extension of classic Multi-Armed Bandits (MAB) under Differential Privacy (DP) and stronger Local Differential Privacy (LDP) setting.
arXiv Detail & Related papers (2020-06-01T04:23:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.