Related papers: Multi-Armed Bandits with Interference

Multi-Armed Bandits with Interference

URL: http://arxiv.org/abs/2402.01845v2
Date: Mon, 15 Jul 2024 19:17:42 GMT
Title: Multi-Armed Bandits with Interference
Authors: Su Jia, Peter Frazier, Nathan Kallus,
Abstract summary: We show that switchback policies achieve an optimal em expected regret $tilde O(sqrt T)$ against the best fixed-arm policy. We propose a cluster randomization policy whose regret is optimal in em expectation.
Score: 39.578572123342944
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Experimentation with interference poses a significant challenge in contemporary online platforms. Prior research on experimentation with interference has concentrated on the final output of a policy. The cumulative performance, while equally crucial, is less well understood. To address this gap, we introduce the problem of {\em Multi-armed Bandits with Interference} (MABI), where the learner assigns an arm to each of $N$ experimental units over a time horizon of $T$ rounds. The reward of each unit in each round depends on the treatments of {\em all} units, where the influence of a unit decays in the spatial distance between units. Furthermore, we employ a general setup wherein the reward functions are chosen by an adversary and may vary arbitrarily across rounds and units. We first show that switchback policies achieve an optimal {\em expected} regret $\tilde O(\sqrt T)$ against the best fixed-arm policy. Nonetheless, the regret (as a random variable) for any switchback policy suffers a high variance, as it does not account for $N$. We propose a cluster randomization policy whose regret (i) is optimal in {\em expectation} and (ii) admits a high probability bound that vanishes in $N$.

Related papers

Follow-the-Perturbed-Leader Approaches Best-of-Both-Worlds for the m-Set Semi-Bandit Problems [20.47953854427799]
We consider a common case of the semi-bandit problem, where the learner exactly selects $m$ arms from the total $d$ arms. We show that FTPL with a Fr'echet perturbation also enjoys the near optimal regret bound $mathcalO(sqrtnmdlog(d))$ in the adversarial setting.
arXiv Detail & Related papers (2025-04-09T22:07:01Z)
Catoni Contextual Bandits are Robust to Heavy-tailed Rewards [31.381627608971414]
We develop an algorithmic approach building on Catoni's estimator from robust statistics. We establish a regret bound that depends only on the cumulative reward variance and logarithmically on the reward range $R$. Our algorithm also enjoys a variance-based bound with logarithmic reward-range dependence.
arXiv Detail & Related papers (2025-02-04T17:03:32Z)
Multi-Armed Bandits with Network Interference [5.708360037632367]
We study a multi-armed bandit (MAB) problem where a learner sequentially assigns one of possible $mathcalA$ actions (discounts) to $N$ units (goods) over $T$ rounds to minimize regret. Unlike traditional MAB problems, the reward of each unit depends on the treatments assigned to other units, i.e., there is interference across the underlying network of units. We propose simple, linear regression-based algorithms to minimize regret.
arXiv Detail & Related papers (2024-05-28T22:01:50Z)
Short-lived High-volume Multi-A(rmed)/B(andits) Testing [6.707544598445059]
We study a Bayesian multiple-play bandit problem with a high volume of short-lived arms. We show that when $k = O(nrho)$ for some constant $rho>0$, our proposed policy has $tilde O(n-min rho, frac 12 (1+frac 1w)-1)$ loss on a sufficiently large class of prior distributions.
arXiv Detail & Related papers (2023-12-23T21:38:35Z)
Allocating Divisible Resources on Arms with Unknown and Random Rewards [25.93048671326331]
We consider a decision maker allocating one unit of renewable and divisible resource in each period on a number of arms. The arms have unknown and random rewards whose means are proportional to the allocated resource and whose variances are proportional to an order $b$ of the allocated resource.
arXiv Detail & Related papers (2023-06-28T21:59:11Z)
Communication-Constrained Bandits under Additive Gaussian Noise [111.06688156723018]
We study a distributed multi-armed bandit where a client supplies the learner with communication-constrained feedback. We propose a multi-phase bandit algorithm, $mathttUEtext-UCB++$, which matches this lower bound to a minor additive factor.
arXiv Detail & Related papers (2023-04-25T09:31:20Z)
Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits [81.60136088841948]
We propose an algorithm that minimizes the regret over the horizon of time $T$. The proposed algorithm is applicable to domains such as recommendation systems and transportation.
arXiv Detail & Related papers (2023-01-31T03:49:00Z)
Top $K$ Ranking for Multi-Armed Bandit with Noisy Evaluations [102.32996053572144]
We consider a multi-armed bandit setting where, at the beginning of each round, the learner receives noisy independent evaluations of the true reward of each arm. We derive different algorithmic approaches and theoretical guarantees depending on how the evaluations are generated.
arXiv Detail & Related papers (2021-12-13T09:48:54Z)
Linear Contextual Bandits with Adversarial Corruptions [91.38793800392108]
We study the linear contextual bandit problem in the presence of adversarial corruption. We present a variance-aware algorithm that is adaptive to the level of adversarial contamination $C$.
arXiv Detail & Related papers (2021-10-25T02:53:24Z)
Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints. For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z)
Learning and Optimization with Seasonal Patterns [3.7578900993400626]
We consider a nonstationary MAB model with $K$ arms whose mean rewards vary over time in a periodic manner. We propose a two-stage policy that combines a confidence-bound analysis with a learning procedure to learn the unknown periods. We show that our learning policy incurs a regret upper bound $tildeO(sqrtTsum_k=1K T_k)$ where $T_k$ is the period of arm $k$.
arXiv Detail & Related papers (2020-05-16T20:19:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.