Multi-Armed Bandits with Censored Consumption of Resources
- URL: http://arxiv.org/abs/2011.00813v1
- Date: Mon, 2 Nov 2020 08:27:38 GMT
- Title: Multi-Armed Bandits with Censored Consumption of Resources
- Authors: Viktor Bengs and Eyke H\"ullermeier
- Abstract summary: We consider a resource-aware variant of the classical multi-armed bandit problem.
In each round, the learner selects an arm and determines a resource limit.
It then observes a corresponding (random) reward, provided the (random) amount of consumed resources remains below the limit.
- Score: 9.803834317538747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider a resource-aware variant of the classical multi-armed bandit
problem: In each round, the learner selects an arm and determines a resource
limit. It then observes a corresponding (random) reward, provided the (random)
amount of consumed resources remains below the limit. Otherwise, the
observation is censored, i.e., no reward is obtained. For this problem setting,
we introduce a measure of regret, which incorporates the actual amount of
allocated resources of each learning round as well as the optimality of
realizable rewards. Thus, to minimize regret, the learner needs to set a
resource limit and choose an arm in such a way that the chance to realize a
high reward within the predefined resource limit is high, while the resource
limit itself should be kept as low as possible. We derive the theoretical lower
bound on the cumulative regret and propose a learning algorithm having a regret
upper bound that matches the lower bound. In a simulation study, we show that
our learning algorithm outperforms straightforward extensions of standard
multi-armed bandit algorithms.
Related papers
- Multi-Armed Bandits with Abstention [62.749500564313834]
We introduce a novel extension of the canonical multi-armed bandit problem that incorporates an additional strategic element: abstention.
In this enhanced framework, the agent is not only tasked with selecting an arm at each time step, but also has the option to abstain from accepting the instantaneous reward before observing it.
arXiv Detail & Related papers (2024-02-23T06:27:12Z) - Clustered Linear Contextual Bandits with Knapsacks [9.668078830796999]
We study clustered contextual bandits where rewards and resource consumption are the outcomes of cluster-specific linear models.
Pulling an arm in a time period results in a reward and in consumption for each one of multiple resources.
We show that it suffices to perform clustering only once to a randomly selected subset of the arms.
arXiv Detail & Related papers (2023-08-21T13:47:13Z) - Solving Multi-Arm Bandit Using a Few Bits of Communication [42.13277217013971]
Multi-armed bandit (MAB) problem is an active learning framework that aims to select the best among a set of actions by sequentially observing rewards.
We address the communication problem by optimizing the communication of rewards collected by distributed agents.
We establish a generic reward quantization algorithm, QuBan, that can be applied on top of any (no-regret) MAB algorithm to form a new communication-efficient counterpart.
arXiv Detail & Related papers (2021-11-11T06:23:16Z) - Combinatorial Multi-armed Bandits for Resource Allocation [23.869080705146395]
Motivating examples include allocating limited computing time or wireless spectrum bands to multiple users.
Decision maker should learn the value of the resources allocated for each user from feedback on each user's received reward.
arXiv Detail & Related papers (2021-05-10T13:55:30Z) - Censored Semi-Bandits for Resource Allocation [12.450488894967773]
We consider the problem of sequentially allocating resources in a censored semi-bandits setup.
The loss depends on two hidden parameters, one specific to the arm but independent of the resource allocation, and the other depends on the allocated resource.
We derive optimal algorithms for our problem setting using known algorithms for MP-MAB and Combinatorial Semi-Bandits.
arXiv Detail & Related papers (2021-04-12T19:15:32Z) - Coordinated Online Learning for Multi-Agent Systems with Coupled
Constraints and Perturbed Utility Observations [91.02019381927236]
We introduce a novel method to steer the agents toward a stable population state, fulfilling the given resource constraints.
The proposed method is a decentralized resource pricing method based on the resource loads resulting from the augmentation of the game's Lagrangian.
arXiv Detail & Related papers (2020-10-21T10:11:17Z) - Lenient Regret for Multi-Armed Bandits [72.56064196252498]
We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took.
While the majority of algorithms try to minimize the regret, i.e., the cumulative difference between the reward of the best action and the agent's action, this criterion might lead to undesirable results.
We suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some $epsilon$.
arXiv Detail & Related papers (2020-08-10T08:30:52Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Constrained regret minimization for multi-criterion multi-armed bandits [5.349852254138086]
We study the problem of regret minimization over a given time horizon, subject to a risk constraint.
We propose a Risk-Constrained Lower Confidence Bound algorithm that guarantees logarithmic regret.
We prove lower bounds on the performance of any risk-constrained regret minimization algorithm.
arXiv Detail & Related papers (2020-06-17T04:23:18Z) - The Price of Incentivizing Exploration: A Characterization via Thompson
Sampling and Sample Complexity [83.81297078039836]
We consider incentivized exploration: a version of multi-armed bandits where the choice of arms is controlled by self-interested agents.
We focus on the price of incentives: the loss in performance, broadly construed, incurred for the sake of incentive-compatibility.
arXiv Detail & Related papers (2020-02-03T04:58:51Z) - Thompson Sampling Algorithms for Mean-Variance Bandits [97.43678751629189]
We develop Thompson Sampling-style algorithms for mean-variance MAB.
We also provide comprehensive regret analyses for Gaussian and Bernoulli bandits.
Our algorithms significantly outperform existing LCB-based algorithms for all risk tolerances.
arXiv Detail & Related papers (2020-02-01T15:33:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.