Budgeted Multi-Armed Bandits with Asymmetric Confidence Intervals
- URL: http://arxiv.org/abs/2306.07071v2
- Date: Tue, 15 Aug 2023 10:55:53 GMT
- Title: Budgeted Multi-Armed Bandits with Asymmetric Confidence Intervals
- Authors: Marco Heyden, Vadim Arzamasov, Edouard Fouch\'e, Klemens B\"ohm
- Abstract summary: We study the Budgeted Multi-Armed Bandit (MAB) problem, where a player chooses from $K$ arms with unknown expected rewards and costs.
We propose a new upper confidence bound (UCB) sampling policy, $omega$-UCB, that uses asymmetric confidence intervals.
These intervals scale with the distance between the sample mean and the bounds of a random variable, yielding a more accurate and tight estimation of the reward-cost ratio.
- Score: 0.9831489366502302
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We study the stochastic Budgeted Multi-Armed Bandit (MAB) problem, where a
player chooses from $K$ arms with unknown expected rewards and costs. The goal
is to maximize the total reward under a budget constraint. A player thus seeks
to choose the arm with the highest reward-cost ratio as often as possible.
Current state-of-the-art policies for this problem have several issues, which
we illustrate. To overcome them, we propose a new upper confidence bound (UCB)
sampling policy, $\omega$-UCB, that uses asymmetric confidence intervals. These
intervals scale with the distance between the sample mean and the bounds of a
random variable, yielding a more accurate and tight estimation of the
reward-cost ratio compared to our competitors. We show that our approach has
logarithmic regret and consistently outperforms existing policies in synthetic
and real settings.
Related papers
- Best Arm Identification with Minimal Regret [55.831935724659175]
Best arm identification problem elegantly amalgamates regret minimization and BAI.
Agent's goal is to identify the best arm with a prescribed confidence level.
Double KL-UCB algorithm achieves optimality as the confidence level tends to zero.
arXiv Detail & Related papers (2024-09-27T16:46:02Z) - Pure Exploration for Constrained Best Mixed Arm Identification with a Fixed Budget [6.22018632187078]
We introduce the constrained best mixed arm identification (CBMAI) problem with a fixed budget.
The goal is to find the best mixed arm that maximizes the expected reward subject to constraints on the expected costs with a given learning budget $N$.
We provide a theoretical upper bound on the mis-identification (of the the support of the best mixed arm) probability and show that it decays exponentially in the budget $N$.
arXiv Detail & Related papers (2024-05-23T22:35:11Z) - Best Arm Identification with Fixed Budget: A Large Deviation Perspective [54.305323903582845]
We present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
In particular, we present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
arXiv Detail & Related papers (2023-12-19T13:17:43Z) - Thompson Exploration with Best Challenger Rule in Best Arm
Identification [66.33448474838342]
We study the fixed-confidence best arm identification problem in the bandit framework.
We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z) - Fixed-Budget Best-Arm Identification with Heterogeneous Reward Variances [12.00630538470713]
We study the problem of best-arm identification (BAI) in the fixed-budget setting with heterogeneous reward variances.
We propose two variance-adaptive BAI algorithms for this setting: SHVar for known reward variances and SHAdaVar for unknown reward variances.
arXiv Detail & Related papers (2023-06-13T05:41:38Z) - Covariance Adaptive Best Arm Identification [0.0]
The goal is to identify the arm with the highest mean reward with a probability of at least 1 -- $delta$, while minimizing the number of arm pulls.
We propose a more flexible scenario where arms can be dependent and rewards can be sampled simultaneously.
This framework is relevant in various applications, such as clinical trials, where similarities between patients or drugs suggest underlying correlations.
arXiv Detail & Related papers (2023-06-05T06:57:09Z) - BanditQ: Fair Bandits with Guaranteed Rewards [10.74025233418392]
Classic no-regret multi-armed bandit algorithms are inherently unfair by design.
We propose a new online policy, called BanditQ, that achieves the target reward rates while conceding a regret and target rate violation penalty.
The proposed policy is efficient and admits a black-box reduction from the fair prediction problem to the standard adversarial MAB problem.
arXiv Detail & Related papers (2023-04-11T13:39:47Z) - Batch-Size Independent Regret Bounds for Combinatorial Semi-Bandits with
Probabilistically Triggered Arms or Independent Arms [53.89752069984382]
We study the semi-bandits (CMAB) and focus on reducing the dependency of the batch-size $K$ in the regret bound.
First, for the setting of CMAB with probabilistically triggered arms (CMAB-T), we propose a BCUCB-T algorithm with variance-aware confidence intervals.
Second, for the setting of non-triggering CMAB with independent arms, we propose a SESCB algorithm which leverages on the non-triggering version of the TPVM condition.
arXiv Detail & Related papers (2022-08-31T13:09:39Z) - Best Arm Identification in Restless Markov Multi-Armed Bandits [85.55466536537293]
We study the problem of identifying the best arm in a multi-armed bandit environment.
A decision entity wishes to find the index of the best arm as quickly as possible, subject to an upper bound error probability.
We show that this policy achieves an upper bound that depends on $R$ and is monotonically non-increasing as $Rtoinfty$.
arXiv Detail & Related papers (2022-03-29T04:58:04Z) - A Farewell to Arms: Sequential Reward Maximization on a Budget with a
Giving Up Option [5.1629297054995265]
We consider a sequential decision-making problem where an agent can take one action at a time and each action has a temporal extent.
We introduce an upper confidence based-algorithm, WAIT-UCB, for which we establish logarithmic, problem-dependent regret bound.
arXiv Detail & Related papers (2020-03-06T22:16:20Z) - Robustness Guarantees for Mode Estimation with an Application to Bandits [131.21717367564963]
We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean.
We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences.
arXiv Detail & Related papers (2020-03-05T21:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.