A Farewell to Arms: Sequential Reward Maximization on a Budget with a
Giving Up Option
- URL: http://arxiv.org/abs/2003.03456v1
- Date: Fri, 6 Mar 2020 22:16:20 GMT
- Title: A Farewell to Arms: Sequential Reward Maximization on a Budget with a
Giving Up Option
- Authors: P Sharoff, Nishant A. Mehta, Ravi Ganti
- Abstract summary: We consider a sequential decision-making problem where an agent can take one action at a time and each action has a temporal extent.
We introduce an upper confidence based-algorithm, WAIT-UCB, for which we establish logarithmic, problem-dependent regret bound.
- Score: 5.1629297054995265
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider a sequential decision-making problem where an agent can take one
action at a time and each action has a stochastic temporal extent, i.e., a new
action cannot be taken until the previous one is finished. Upon completion, the
chosen action yields a stochastic reward. The agent seeks to maximize its
cumulative reward over a finite time budget, with the option of "giving up" on
a current action -- hence forfeiting any reward -- in order to choose another
action. We cast this problem as a variant of the stochastic multi-armed bandits
problem with stochastic consumption of resource. For this problem, we first
establish that the optimal arm is the one that maximizes the ratio of the
expected reward of the arm to the expected waiting time before the agent sees
the reward due to pulling that arm. Using a novel upper confidence bound on
this ratio, we then introduce an upper confidence based-algorithm, WAIT-UCB,
for which we establish logarithmic, problem-dependent regret bound which has an
improved dependence on problem parameters compared to previous works.
Simulations on various problem configurations comparing WAIT-UCB against the
state-of-the-art algorithms are also presented.
Related papers
- Best Arm Identification with Minimal Regret [55.831935724659175]
Best arm identification problem elegantly amalgamates regret minimization and BAI.
Agent's goal is to identify the best arm with a prescribed confidence level.
Double KL-UCB algorithm achieves optimality as the confidence level tends to zero.
arXiv Detail & Related papers (2024-09-27T16:46:02Z) - Multi-Armed Bandits with Abstention [62.749500564313834]
We introduce a novel extension of the canonical multi-armed bandit problem that incorporates an additional strategic element: abstention.
In this enhanced framework, the agent is not only tasked with selecting an arm at each time step, but also has the option to abstain from accepting the instantaneous reward before observing it.
arXiv Detail & Related papers (2024-02-23T06:27:12Z) - Pure Exploration under Mediators' Feedback [63.56002444692792]
Multi-armed bandits are a sequential-decision-making framework, where, at each interaction step, the learner selects an arm and observes a reward.
We consider the scenario in which the learner has access to a set of mediators, each of which selects the arms on the agent's behalf according to a and possibly unknown policy.
We propose a sequential decision-making strategy for discovering the best arm under the assumption that the mediators' policies are known to the learner.
arXiv Detail & Related papers (2023-08-29T18:18:21Z) - Budgeted Multi-Armed Bandits with Asymmetric Confidence Intervals [0.9831489366502302]
We study the Budgeted Multi-Armed Bandit (MAB) problem, where a player chooses from $K$ arms with unknown expected rewards and costs.
We propose a new upper confidence bound (UCB) sampling policy, $omega$-UCB, that uses asymmetric confidence intervals.
These intervals scale with the distance between the sample mean and the bounds of a random variable, yielding a more accurate and tight estimation of the reward-cost ratio.
arXiv Detail & Related papers (2023-06-12T12:35:16Z) - Best Arm Identification for Stochastic Rising Bandits [84.55453174601826]
Rising Bandits (SRBs) model sequential decision-making problems in which the expected reward of the available options increases every time they are selected.
This paper focuses on the fixed-budget Best Arm Identification (BAI) problem for SRBs.
We propose two algorithms to tackle the above-mentioned setting, namely R-UCBE and R-SR.
arXiv Detail & Related papers (2023-02-15T08:01:37Z) - Mitigating Disparity while Maximizing Reward: Tight Anytime Guarantee
for Improving Bandits [6.537940428724029]
We study the Improving Multi-Armed Bandit (IMAB) problem, where the reward obtained from an arm increases with the number of pulls it receives.
Our main contribution is an anytime algorithm for the IMAB problem that achieves the best possible cumulative reward.
arXiv Detail & Related papers (2022-08-19T10:23:40Z) - Best Arm Identification in Restless Markov Multi-Armed Bandits [85.55466536537293]
We study the problem of identifying the best arm in a multi-armed bandit environment.
A decision entity wishes to find the index of the best arm as quickly as possible, subject to an upper bound error probability.
We show that this policy achieves an upper bound that depends on $R$ and is monotonically non-increasing as $Rtoinfty$.
arXiv Detail & Related papers (2022-03-29T04:58:04Z) - Break your Bandit Routine with LSD Rewards: a Last Switch Dependent
Analysis of Satiation and Seasonality [6.146046338698175]
We introduce a novel non-stationary bandit problem, where the expected reward of an arm is fully determined by the time elapsed since the arm last took part in a switch of actions.
Our model generalizes previous notions of delay-dependent rewards, and also relaxes most assumptions on the reward function.
We prove an algorithm and prove a bound on its regret with respect to the optimal non-stationary policy.
arXiv Detail & Related papers (2021-10-22T14:53:13Z) - Combinatorial Blocking Bandits with Stochastic Delays [33.65025386998747]
Recent work has considered natural variations of the multi-armed bandit problem, where the reward of each arm is a special function of the time passed since its last pulling.
In this work, we extend the above model in two directions: (i) We consider the general setting where more than one arms can be played at each round, subject to feasibility constraints.
We provide a tight analysis of the approximation of a natural greedy subset that always plays the maximum expected reward feasible among the available (non-blocked) arms.
When the arms' expected rewards are unknown, we adapt the above algorithm into a bandit, based on
arXiv Detail & Related papers (2021-05-22T02:46:04Z) - Lenient Regret for Multi-Armed Bandits [72.56064196252498]
We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took.
While the majority of algorithms try to minimize the regret, i.e., the cumulative difference between the reward of the best action and the agent's action, this criterion might lead to undesirable results.
We suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some $epsilon$.
arXiv Detail & Related papers (2020-08-10T08:30:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.