Related papers: A Farewell to Arms: Sequential Reward Maximization on a Budget with a Giving Up Option

A Farewell to Arms: Sequential Reward Maximization on a Budget with a Giving Up Option

URL: http://arxiv.org/abs/2003.03456v1
Date: Fri, 6 Mar 2020 22:16:20 GMT
Title: A Farewell to Arms: Sequential Reward Maximization on a Budget with a Giving Up Option
Authors: P Sharoff, Nishant A. Mehta, Ravi Ganti
Abstract summary: We consider a sequential decision-making problem where an agent can take one action at a time and each action has a temporal extent. We introduce an upper confidence based-algorithm, WAIT-UCB, for which we establish logarithmic, problem-dependent regret bound.
Score: 5.1629297054995265
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider a sequential decision-making problem where an agent can take one action at a time and each action has a stochastic temporal extent, i.e., a new action cannot be taken until the previous one is finished. Upon completion, the chosen action yields a stochastic reward. The agent seeks to maximize its cumulative reward over a finite time budget, with the option of "giving up" on a current action -- hence forfeiting any reward -- in order to choose another action. We cast this problem as a variant of the stochastic multi-armed bandits problem with stochastic consumption of resource. For this problem, we first establish that the optimal arm is the one that maximizes the ratio of the expected reward of the arm to the expected waiting time before the agent sees the reward due to pulling that arm. Using a novel upper confidence bound on this ratio, we then introduce an upper confidence based-algorithm, WAIT-UCB, for which we establish logarithmic, problem-dependent regret bound which has an improved dependence on problem parameters compared to previous works. Simulations on various problem configurations comparing WAIT-UCB against the state-of-the-art algorithms are also presented.

Related papers

Balancing Performance and Costs in Best Arm Identification [5.558508644689221]
We propose a new formalism to minimize the risk functional which explicitly balances the performance of the recommended arm and the cost incurred by learning this arm.<n>In this framework, a cost is incurred for each observation during the sampling phase, and upon recommending an arm, a performance penalty is incurred for identifying a suboptimal arm.<n>We derive theoretical lower bounds for the risk of each of two choices for the performance penalty, the probability of misidentification and the simple regret, and propose an algorithm called DBCARE to match these lower bounds up to polylog factors on nearly all problem instances.
arXiv Detail & Related papers (2025-05-26T23:33:43Z)
Continuous K-Max Bandits [54.21533414838677]
We study the $K$-Max multi-armed bandits problem with continuous outcome distributions and weak value-index feedback. This setting captures critical applications in recommendation systems, distributed computing, server scheduling, etc. Our key contribution is the computationally efficient algorithm DCK-UCB, which combines adaptive discretization with bias-corrected confidence bounds.
arXiv Detail & Related papers (2025-02-19T06:37:37Z)
Optimal Multi-Objective Best Arm Identification with Fixed Confidence [62.36929749450298]
We consider a multi-armed bandit setting in which each arm yields an $M$-dimensional vector reward upon selection. The end goal is to identify the best arm of em every objective in the shortest (expected) time subject to an upper bound on the probability of error. We propose an algorithm that uses the novel idea of em surrogate proportions to sample the arms at each time step, eliminating the need to solve the max-min optimisation problem at each step.
arXiv Detail & Related papers (2025-01-23T12:28:09Z)
Best Arm Identification with Minimal Regret [55.831935724659175]
Best arm identification problem elegantly amalgamates regret minimization and BAI. Agent's goal is to identify the best arm with a prescribed confidence level. Double KL-UCB algorithm achieves optimality as the confidence level tends to zero.
arXiv Detail & Related papers (2024-09-27T16:46:02Z)
Multi-Armed Bandits with Abstention [62.749500564313834]
We introduce a novel extension of the canonical multi-armed bandit problem that incorporates an additional strategic element: abstention. In this enhanced framework, the agent is not only tasked with selecting an arm at each time step, but also has the option to abstain from accepting the instantaneous reward before observing it.
arXiv Detail & Related papers (2024-02-23T06:27:12Z)
Robust and Performance Incentivizing Algorithms for Multi-Armed Bandits with Strategic Agents [52.75161794035767]
We introduce a class of bandit algorithms that meet the two objectives of performance incentivization and robustness simultaneously. We show that settings where the principal has no information about the arms' performance characteristics can be handled by combining ideas from second price auctions with our algorithms.
arXiv Detail & Related papers (2023-12-13T06:54:49Z)
Pure Exploration under Mediators' Feedback [63.56002444692792]
Multi-armed bandits are a sequential-decision-making framework, where, at each interaction step, the learner selects an arm and observes a reward. We consider the scenario in which the learner has access to a set of mediators, each of which selects the arms on the agent's behalf according to a and possibly unknown policy. We propose a sequential decision-making strategy for discovering the best arm under the assumption that the mediators' policies are known to the learner.
arXiv Detail & Related papers (2023-08-29T18:18:21Z)
Budgeted Multi-Armed Bandits with Asymmetric Confidence Intervals [0.9831489366502302]
We study the Budgeted Multi-Armed Bandit (MAB) problem, where a player chooses from $K$ arms with unknown expected rewards and costs. We propose a new upper confidence bound (UCB) sampling policy, $omega$-UCB, that uses asymmetric confidence intervals. These intervals scale with the distance between the sample mean and the bounds of a random variable, yielding a more accurate and tight estimation of the reward-cost ratio.
arXiv Detail & Related papers (2023-06-12T12:35:16Z)
Best Arm Identification for Stochastic Rising Bandits [84.55453174601826]
Rising Bandits (SRBs) model sequential decision-making problems in which the expected reward of the available options increases every time they are selected. This paper focuses on the fixed-budget Best Arm Identification (BAI) problem for SRBs. We propose two algorithms to tackle the above-mentioned setting, namely R-UCBE and R-SR.
arXiv Detail & Related papers (2023-02-15T08:01:37Z)
Mitigating Disparity while Maximizing Reward: Tight Anytime Guarantee for Improving Bandits [6.537940428724029]
We study the Improving Multi-Armed Bandit (IMAB) problem, where the reward obtained from an arm increases with the number of pulls it receives. Our main contribution is an anytime algorithm for the IMAB problem that achieves the best possible cumulative reward.
arXiv Detail & Related papers (2022-08-19T10:23:40Z)
Best Arm Identification in Restless Markov Multi-Armed Bandits [85.55466536537293]
We study the problem of identifying the best arm in a multi-armed bandit environment. A decision entity wishes to find the index of the best arm as quickly as possible, subject to an upper bound error probability. We show that this policy achieves an upper bound that depends on $R$ and is monotonically non-increasing as $Rtoinfty$.
arXiv Detail & Related papers (2022-03-29T04:58:04Z)
Break your Bandit Routine with LSD Rewards: a Last Switch Dependent Analysis of Satiation and Seasonality [6.146046338698175]
We introduce a novel non-stationary bandit problem, where the expected reward of an arm is fully determined by the time elapsed since the arm last took part in a switch of actions. Our model generalizes previous notions of delay-dependent rewards, and also relaxes most assumptions on the reward function. We prove an algorithm and prove a bound on its regret with respect to the optimal non-stationary policy.
arXiv Detail & Related papers (2021-10-22T14:53:13Z)
Combinatorial Blocking Bandits with Stochastic Delays [33.65025386998747]
Recent work has considered natural variations of the multi-armed bandit problem, where the reward of each arm is a special function of the time passed since its last pulling. In this work, we extend the above model in two directions: (i) We consider the general setting where more than one arms can be played at each round, subject to feasibility constraints. We provide a tight analysis of the approximation of a natural greedy subset that always plays the maximum expected reward feasible among the available (non-blocked) arms. When the arms' expected rewards are unknown, we adapt the above algorithm into a bandit, based on
arXiv Detail & Related papers (2021-05-22T02:46:04Z)
Lenient Regret for Multi-Armed Bandits [72.56064196252498]
We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took. While the majority of algorithms try to minimize the regret, i.e., the cumulative difference between the reward of the best action and the agent's action, this criterion might lead to undesirable results. We suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some $epsilon$.
arXiv Detail & Related papers (2020-08-10T08:30:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.