Nonstationary Stochastic Multiarmed Bandits: UCB Policies and Minimax
Regret
- URL: http://arxiv.org/abs/2101.08980v1
- Date: Fri, 22 Jan 2021 07:34:09 GMT
- Title: Nonstationary Stochastic Multiarmed Bandits: UCB Policies and Minimax
Regret
- Authors: Lai Wei and Vaibhav Srivastava
- Abstract summary: We study the nonstationary Multi-Armed Bandit (MAB) problem in which the distribution of rewards associated with each arm are assumed to be time-varying.
We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget.
- Score: 5.1398743023989555
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem in
which the distribution of rewards associated with each arm are assumed to be
time-varying and the total variation in the expected rewards is subject to a
variation budget. The regret of a policy is defined by the difference in the
expected cumulative rewards obtained using the policy and using an oracle that
selects the arm with the maximum mean reward at each time. We characterize the
performance of the proposed policies in terms of the worst-case regret, which
is the supremum of the regret over the set of reward distribution sequences
satisfying the variation budget. We extend Upper-Confidence Bound (UCB)-based
policies with three different approaches, namely, periodic resetting, sliding
observation window and discount factor and show that they are order-optimal
with respect to the minimax regret, i.e., the minimum worst-case regret
achieved by any policy. We also relax the sub-Gaussian assumption on reward
distributions and develop robust versions the proposed polices that can handle
heavy-tailed reward distributions and maintain their performance guarantees.
Related papers
- Stochastic Bandits for Egalitarian Assignment [58.33714486693828]
We study EgalMAB, an egalitarian assignment problem in the context of multi-armed bandits.
We design and analyze a UCB-based policy EgalUCB and establish upper bounds on the cumulative regret.
arXiv Detail & Related papers (2024-10-08T09:49:47Z) - An Adaptive Approach for Infinitely Many-armed Bandits under Generalized Rotting Constraints [29.596684377841182]
In this study, we consider the infinitely many-armed bandit problems in a rested rotting setting, where the mean reward of an arm may decrease with each pull, while otherwise, it remains unchanged.
We introduce an algorithm that utilizes UCB with an adaptive sliding window, designed to manage the bias and variance trade-off arising due to rotting rewards.
arXiv Detail & Related papers (2024-04-22T14:11:54Z) - Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds.
We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$.
Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z) - Budgeted Multi-Armed Bandits with Asymmetric Confidence Intervals [0.9831489366502302]
We study the Budgeted Multi-Armed Bandit (MAB) problem, where a player chooses from $K$ arms with unknown expected rewards and costs.
We propose a new upper confidence bound (UCB) sampling policy, $omega$-UCB, that uses asymmetric confidence intervals.
These intervals scale with the distance between the sample mean and the bounds of a random variable, yielding a more accurate and tight estimation of the reward-cost ratio.
arXiv Detail & Related papers (2023-06-12T12:35:16Z) - Contextual bandits with concave rewards, and an application to fair
ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR)
We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem.
Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z) - A Simple and Optimal Policy Design with Safety against Heavy-Tailed Risk for Stochastic Bandits [27.058087400790555]
We study the multi-armed bandit problem and design new policies that enjoy both worst-case optimality for expected regret and light-tailed risk for regret distribution.
We find that from a managerial perspective, our new policy design yields better tail distributions and is preferable than celebrated policies.
arXiv Detail & Related papers (2022-06-07T02:10:30Z) - Stochastic differential equations for limiting description of UCB rule
for Gaussian multi-armed bandits [0.0]
We consider the upper confidence bound strategy for multi-armed bandits with known control horizon sizes $N$.
A set of Monte-Carlo simulations was performed for the case of close distributions of rewards, when mean rewards differ by the magnitude of order $N-1/2$.
The minimal size of the control horizon when the normalized regret is not noticeably larger than maximum possible was estimated.
arXiv Detail & Related papers (2021-12-13T05:24:59Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Minimax Policy for Heavy-tailed Bandits [10.203602318836444]
We modify the minimax policy MOSS for the sub-Gaussian reward distribution by using saturated empirical mean to design a new algorithm called Robust MOSS.
We show that if the moment of order $1+epsilon$ for the reward distribution exists, then the refined strategy has a worst-case regret matching the lower bound while maintaining a distribution-dependent regret.
arXiv Detail & Related papers (2020-07-20T21:43:57Z) - Adaptive Discretization for Adversarial Lipschitz Bandits [85.39106976861702]
Lipschitz bandits is a prominent version of multi-armed bandits that studies large, structured action spaces.
A central theme here is the adaptive discretization of the action space, which gradually zooms in'' on the more promising regions.
We provide the first algorithm for adaptive discretization in the adversarial version, and derive instance-dependent regret bounds.
arXiv Detail & Related papers (2020-06-22T16:06:25Z) - Robustness Guarantees for Mode Estimation with an Application to Bandits [131.21717367564963]
We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean.
We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences.
arXiv Detail & Related papers (2020-03-05T21:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.