Related papers: Short-lived High-volume Multi-A(rmed)/B(andits) Testing

Short-lived High-volume Multi-A(rmed)/B(andits) Testing

URL: http://arxiv.org/abs/2312.15356v1
Date: Sat, 23 Dec 2023 21:38:35 GMT
Title: Short-lived High-volume Multi-A(rmed)/B(andits) Testing
Authors: Su Jia, Andrew Li, R. Ravi, Nishant Oli, Paul Duff, Ian Anderson
Abstract summary: We study a Bayesian multiple-play bandit problem with a high volume of short-lived arms. We show that when $k = O(nrho)$ for some constant $rho>0$, our proposed policy has $tilde O(n-min rho, frac 12 (1+frac 1w)-1)$ loss on a sufficiently large class of prior distributions.
Score: 6.707544598445059
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern platforms leverage randomized experiments to make informed decisions from a given set of items (``treatments''). As a particularly challenging scenario, these items may (i) arrive in high volume, with thousands of new items being released per hour, and (ii) have short lifetime, say, due to the item's transient nature or underlying non-stationarity that impels the platform to perceive the same item as distinct copies over time. Thus motivated, we study a Bayesian multiple-play bandit problem that encapsulates the key features of the multivariate testing (or ``multi-A/B testing'') problem with a high volume of short-lived arms. In each round, a set of $k$ arms arrive, each available for $w$ rounds. Without knowing the mean reward for each arm, the learner selects a multiset of $n$ arms and immediately observes their realized rewards. We aim to minimize the loss due to not knowing the mean rewards, averaged over instances generated from a given prior distribution. We show that when $k = O(n^\rho)$ for some constant $\rho>0$, our proposed policy has $\tilde O(n^{-\min \{\rho, \frac 12 (1+\frac 1w)^{-1}\}})$ loss on a sufficiently large class of prior distributions. We complement this result by showing that every policy suffers $\Omega (n^{-\min \{\rho, \frac 12\}})$ loss on the same class of distributions. We further validate the effectiveness of our policy through a large-scale field experiment on {\em Glance}, a content-card-serving platform that faces exactly the above challenge. A simple variant of our policy outperforms the platform's current recommender by 4.32\% in total duration and 7.48\% in total number of click-throughs.

Related papers

Fast Rates for Bandit PAC Multiclass Classification [73.17969992976501]
We study multiclass PAC learning with bandit feedback, where inputs are classified into one of $K$ possible labels and feedback is limited to whether or not the predicted labels are correct. Our main contribution is in designing a novel learning algorithm for the agnostic $(varepsilon,delta)$PAC version of the problem.
arXiv Detail & Related papers (2024-06-18T08:54:04Z)
Multi-Armed Bandits with Interference [39.578572123342944]
We show that switchback policies achieve an optimal em expected regret $tilde O(sqrt T)$ against the best fixed-arm policy. We propose a cluster randomization policy whose regret is optimal in em expectation.
arXiv Detail & Related papers (2024-02-02T19:02:47Z)
Allocating Divisible Resources on Arms with Unknown and Random Rewards [25.93048671326331]
We consider a decision maker allocating one unit of renewable and divisible resource in each period on a number of arms. The arms have unknown and random rewards whose means are proportional to the allocated resource and whose variances are proportional to an order $b$ of the allocated resource.
arXiv Detail & Related papers (2023-06-28T21:59:11Z)
Piecewise-Stationary Multi-Objective Multi-Armed Bandit with Application to Joint Communications and Sensing [7.0997346625024]
We propose a generic upper confidence bound (UCB)-based algorithm with change detection to solve this problem. We also formulate an energy-efficient waveform design problem in an integrated communication and sensing system as a toy example.
arXiv Detail & Related papers (2023-02-10T14:10:14Z)
Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits [81.60136088841948]
We propose an algorithm that minimizes the regret over the horizon of time $T$. The proposed algorithm is applicable to domains such as recommendation systems and transportation.
arXiv Detail & Related papers (2023-01-31T03:49:00Z)
Smooth Non-Stationary Bandits [49.19728527803684]
We study a non-stationary bandits problem where each arm's mean reward sequence can be embedded into a $beta$-H"older function. We show the first separation between the smooth (i.e., $betage 2$) and non-smooth (i.e., $beta=1$) regimes by presenting a policy with $tilde O(k4/5 T3/5)$ regret on any $k$-armed, $2$-H"older instance.
arXiv Detail & Related papers (2023-01-29T06:03:20Z)
Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs) Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z)
Risk-averse Contextual Multi-armed Bandit Problem with Linear Payoffs [7.125769932993104]
We consider the contextual multi-armed bandit problem for linear payoffs under a risk-averse criterion. At each round, contexts are revealed for each arm, and the decision maker chooses one arm to pull and receives the corresponding reward. We apply the Thompson Sampling algorithm for the disjoint model, and provide a comprehensive regret analysis for a variant of the proposed algorithm.
arXiv Detail & Related papers (2022-06-24T18:48:35Z)
Top $K$ Ranking for Multi-Armed Bandit with Noisy Evaluations [102.32996053572144]
We consider a multi-armed bandit setting where, at the beginning of each round, the learner receives noisy independent evaluations of the true reward of each arm. We derive different algorithmic approaches and theoretical guarantees depending on how the evaluations are generated.
arXiv Detail & Related papers (2021-12-13T09:48:54Z)
Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy [71.17938026619068]
We study the top-$k$ extreme contextual bandits problem, where the total number of arms can be enormous. We first propose an algorithm for the non-extreme realizable setting, utilizing the Inverse Gap Weighting strategy. We show that our algorithm has a regret guarantee of $O(ksqrt(A-k+1)T log (|mathcalF|T))$.
arXiv Detail & Related papers (2021-02-15T19:10:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.