Extended UCB Policies for Frequentist Multi-armed Bandit Problems
- URL: http://arxiv.org/abs/1112.1768v4
- Date: Mon, 30 Jun 2025 04:15:58 GMT
- Title: Extended UCB Policies for Frequentist Multi-armed Bandit Problems
- Authors: Keqin Liu, Tianshuo Zheng, Zhi-Hua Zhou,
- Abstract summary: This paper mainly considers the classical MAB model with the heavy-tailed reward distributions.<n>We introduce the extended robust UCB policy, which is an extension of the pioneering UCB policies.
- Score: 49.1574468325115
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The multi-armed bandit (MAB) problem is a widely studied model in the field of operations research for sequential decision making and reinforcement learning. This paper mainly considers the classical MAB model with the heavy-tailed reward distributions. We introduce the extended robust UCB policy, which is an extension of the pioneering UCB policies proposed by Bubeck et al. [5] and Lattimore [22]. The previous UCB policies require some strict conditions on the reward distributions, which can be hard to guarantee in practical scenarios. Our extended robust UCB generalizes Lattimore's seminary work (for moments of orders $p=4$ and $q=2$) to arbitrarily chosen $p>q>1$ as long as the two moments have a known controlled relationship, while still achieving the optimal regret growth order $O(log T)$, thus providing a broadened application area of the UCB policies for the heavy-tailed reward distributions. Furthermore, we achieve a near-optimal regret order without any knowledge of the reward distributions as long as their $p$-th moments exist for some $p>1$.
Related papers
- Quick-Draw Bandits: Quickly Optimizing in Nonstationary Environments with Extremely Many Arms [80.05851139852311]
We propose a novel policy to learn reward environments over a continuous space using Gaussian.<n>We show that our method efficiently learns continuous Lipschitz reward functions with $mathcalO*(sqrtT)$ cumulative regret. Furthermore, our method naturally extends to non-stationary problems with a simple modification.
arXiv Detail & Related papers (2025-05-30T15:15:18Z) - Stochastic Bandits for Egalitarian Assignment [58.33714486693828]
We study EgalMAB, an egalitarian assignment problem in the context of multi-armed bandits.
We design and analyze a UCB-based policy EgalUCB and establish upper bounds on the cumulative regret.
arXiv Detail & Related papers (2024-10-08T09:49:47Z) - Data-Driven Upper Confidence Bounds with Near-Optimal Regret for Heavy-Tailed Bandits [0.0]
We propose a new distribution-free, data-driven UCB algorithm for symmetric reward distributions.
We prove a near-optimal regret bound for the proposed anytime, parameter-free RMM-UCB method, even for heavy-tailed distributions.
arXiv Detail & Related papers (2024-06-09T10:06:50Z) - Extreme Value Monte Carlo Tree Search [1.6574413179773757]
UCB1 Multi-Armed Bandit (MAB) has had limited success in domain-independent planning until recently.
We propose two bandits, UCB1-Uniform/Power, and apply them to Monte-Carlo Tree Search (MCTS) for classical planning.
arXiv Detail & Related papers (2024-05-28T14:58:43Z) - Provably Efficient UCB-type Algorithms For Learning Predictive State
Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs)
This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models.
In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z) - Best Arm Identification for Stochastic Rising Bandits [84.55453174601826]
Rising Bandits (SRBs) model sequential decision-making problems in which the expected reward of the available options increases every time they are selected.
This paper focuses on the fixed-budget Best Arm Identification (BAI) problem for SRBs.
We propose two algorithms to tackle the above-mentioned setting, namely R-UCBE and R-SR.
arXiv Detail & Related papers (2023-02-15T08:01:37Z) - Sample Complexity of an Adversarial Attack on UCB-based Best-arm
Identification Policy [0.0]
I study the problem of adversarial perturbations to rewards, in a Multi-armed bandit (MAB) setting.
I focus on an adversarial attack to a UCB type best-arm identification policy applied to a MAB.
arXiv Detail & Related papers (2022-09-13T02:31:30Z) - Best Arm Identification under Additive Transfer Bandits [49.69203462561861]
We study the setting when, despite the means being unknown, there is a known additive relationship between the source and target MAB instances.
We propose and theoretically analyze an LUCB-style algorithm to identify an $epsilon$-optimal target arm with high probability.
arXiv Detail & Related papers (2021-12-08T02:20:18Z) - Achieving the Pareto Frontier of Regret Minimization and Best Arm
Identification in Multi-Armed Bandits [91.8283876874947]
We design and analyze the BoBW-lil'UCB$(gamma)$ algorithm.
We show that (i) no algorithm can simultaneously perform optimally for both the RM and BAI objectives.
We also show that BoBW-lil'UCB$(gamma)$ outperforms a competitor in terms of the time complexity and the regret.
arXiv Detail & Related papers (2021-10-16T17:52:32Z) - Reinforcement Learning in Reward-Mixing MDPs [74.41782017817808]
episodic reinforcement learning in a reward-mixing Markov decision process (MDP)
cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
epsilon$-optimal policy after exploring $tildeO(poly(H,epsilon-1) cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
arXiv Detail & Related papers (2021-10-07T18:55:49Z) - Addressing the Long-term Impact of ML Decisions via Policy Regret [49.92903850297013]
We study a setting in which the reward from each arm evolves every time the decision-maker pulls that arm.
We argue that an acceptable sequential allocation of opportunities must take an arm's potential for growth into account.
We present an algorithm with provably sub-linear policy regret for sufficiently long time horizons.
arXiv Detail & Related papers (2021-06-02T17:38:10Z) - Combinatorial Bandits without Total Order for Arms [52.93972547896022]
We present a reward model that captures set-dependent reward distribution and assumes no total order for arms.
We develop a novel regret analysis and show an $Oleft(frack2 n log Tepsilonright)$ gap-dependent regret bound as well as an $Oleft(k2sqrtn T log Tright)$ gap-independent regret bound.
arXiv Detail & Related papers (2021-03-03T23:08:59Z) - Nonstationary Stochastic Multiarmed Bandits: UCB Policies and Minimax
Regret [5.1398743023989555]
We study the nonstationary Multi-Armed Bandit (MAB) problem in which the distribution of rewards associated with each arm are assumed to be time-varying.
We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget.
arXiv Detail & Related papers (2021-01-22T07:34:09Z) - Learning and Optimization with Seasonal Patterns [3.7578900993400626]
We consider a nonstationary MAB model with $K$ arms whose mean rewards vary over time in a periodic manner.
We propose a two-stage policy that combines a confidence-bound analysis with a learning procedure to learn the unknown periods.
We show that our learning policy incurs a regret upper bound $tildeO(sqrtTsum_k=1K T_k)$ where $T_k$ is the period of arm $k$.
arXiv Detail & Related papers (2020-05-16T20:19:25Z) - Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage
Decomposition [59.34067736545355]
We study the reinforcement learning problem in finite-horizon episodic Markov Decision Processes (MDPs) with $S$ states, $A$ actions, and episode length $H$.
We propose a model-free algorithm UCB-Advantage and prove that it achieves $tildeO(sqrtH2SAT)$ regret where $T = KH$ and $K$ is the number of episodes to play.
arXiv Detail & Related papers (2020-04-21T14:00:06Z) - The Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms [10.662105162882526]
We investigate a $k$-armed bandit problem in the emphmany-armed regime.
Our findings suggest a new form of emphfree exploration beneficial to greedy algorithms in the many-armed context.
arXiv Detail & Related papers (2020-02-24T08:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.