Nonstationary Stochastic Multiarmed Bandits: UCB Policies and Minimax
Regret
- URL: http://arxiv.org/abs/2101.08980v1
- Date: Fri, 22 Jan 2021 07:34:09 GMT
- Title: Nonstationary Stochastic Multiarmed Bandits: UCB Policies and Minimax
Regret
- Authors: Lai Wei and Vaibhav Srivastava
- Abstract summary: We study the nonstationary Multi-Armed Bandit (MAB) problem in which the distribution of rewards associated with each arm are assumed to be time-varying.
We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget.
- Score: 5.1398743023989555
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem in
which the distribution of rewards associated with each arm are assumed to be
time-varying and the total variation in the expected rewards is subject to a
variation budget. The regret of a policy is defined by the difference in the
expected cumulative rewards obtained using the policy and using an oracle that
selects the arm with the maximum mean reward at each time. We characterize the
performance of the proposed policies in terms of the worst-case regret, which
is the supremum of the regret over the set of reward distribution sequences
satisfying the variation budget. We extend Upper-Confidence Bound (UCB)-based
policies with three different approaches, namely, periodic resetting, sliding
observation window and discount factor and show that they are order-optimal
with respect to the minimax regret, i.e., the minimum worst-case regret
achieved by any policy. We also relax the sub-Gaussian assumption on reward
distributions and develop robust versions the proposed polices that can handle
heavy-tailed reward distributions and maintain their performance guarantees.
Related papers
- Continuous K-Max Bandits [54.21533414838677]
We study the $K$-Max multi-armed bandits problem with continuous outcome distributions and weak value-index feedback.
This setting captures critical applications in recommendation systems, distributed computing, server scheduling, etc.
Our key contribution is the computationally efficient algorithm DCK-UCB, which combines adaptive discretization with bias-corrected confidence bounds.
arXiv Detail & Related papers (2025-02-19T06:37:37Z) - Catoni Contextual Bandits are Robust to Heavy-tailed Rewards [31.381627608971414]
We develop an algorithmic approach building on Catoni's estimator from robust statistics.
We establish a regret bound that depends only on the cumulative reward variance and logarithmically on the reward range $R$.
Our algorithm also enjoys a variance-based bound with logarithmic reward-range dependence.
arXiv Detail & Related papers (2025-02-04T17:03:32Z) - Constrained Best Arm Identification in Grouped Bandits [3.387374559368306]
We study a grouped bandit setting where each arm comprises multiple independent sub-arms referred to as attributes.
We impose the constraint that for an arm to be deemed feasible, the mean reward of all its attributes should exceed a specified threshold.
The goal is to find the arm with the highest mean reward averaged across attributes among the set of feasible arms in the fixed confidence setting.
arXiv Detail & Related papers (2024-12-11T02:19:19Z) - Stochastic Bandits for Egalitarian Assignment [58.33714486693828]
We study EgalMAB, an egalitarian assignment problem in the context of multi-armed bandits.
We design and analyze a UCB-based policy EgalUCB and establish upper bounds on the cumulative regret.
arXiv Detail & Related papers (2024-10-08T09:49:47Z) - Budgeted Multi-Armed Bandits with Asymmetric Confidence Intervals [0.9831489366502302]
We study the Budgeted Multi-Armed Bandit (MAB) problem, where a player chooses from $K$ arms with unknown expected rewards and costs.
We propose a new upper confidence bound (UCB) sampling policy, $omega$-UCB, that uses asymmetric confidence intervals.
These intervals scale with the distance between the sample mean and the bounds of a random variable, yielding a more accurate and tight estimation of the reward-cost ratio.
arXiv Detail & Related papers (2023-06-12T12:35:16Z) - Contextual bandits with concave rewards, and an application to fair
ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR)
We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem.
Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z) - Stochastic differential equations for limiting description of UCB rule
for Gaussian multi-armed bandits [0.0]
We consider the upper confidence bound strategy for multi-armed bandits with known control horizon sizes $N$.
A set of Monte-Carlo simulations was performed for the case of close distributions of rewards, when mean rewards differ by the magnitude of order $N-1/2$.
The minimal size of the control horizon when the normalized regret is not noticeably larger than maximum possible was estimated.
arXiv Detail & Related papers (2021-12-13T05:24:59Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Minimax Policy for Heavy-tailed Bandits [10.203602318836444]
We modify the minimax policy MOSS for the sub-Gaussian reward distribution by using saturated empirical mean to design a new algorithm called Robust MOSS.
We show that if the moment of order $1+epsilon$ for the reward distribution exists, then the refined strategy has a worst-case regret matching the lower bound while maintaining a distribution-dependent regret.
arXiv Detail & Related papers (2020-07-20T21:43:57Z) - Adaptive Discretization for Adversarial Lipschitz Bandits [85.39106976861702]
Lipschitz bandits is a prominent version of multi-armed bandits that studies large, structured action spaces.
A central theme here is the adaptive discretization of the action space, which gradually zooms in'' on the more promising regions.
We provide the first algorithm for adaptive discretization in the adversarial version, and derive instance-dependent regret bounds.
arXiv Detail & Related papers (2020-06-22T16:06:25Z) - Robustness Guarantees for Mode Estimation with an Application to Bandits [131.21717367564963]
We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean.
We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences.
arXiv Detail & Related papers (2020-03-05T21:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.