Adapting Bandit Algorithms for Settings with Sequentially Available Arms
- URL: http://arxiv.org/abs/2109.15228v1
- Date: Thu, 30 Sep 2021 15:56:37 GMT
- Title: Adapting Bandit Algorithms for Settings with Sequentially Available Arms
- Authors: Marco Gabrielli, Francesco Trov\`o, Manuela Antonelli
- Abstract summary: We propose a meta-algorithm, namely Sequential Pull/No-pull for MAB (Seq), to adapt any classical MAB policy to better suit this setting.
The proposed meta-algorithm gathers more information, especially in the first rounds, characterized by a high uncertainty in the arms estimate value.
The Seq meta-algorithm was extensively tested and compared with classical MAB policies on synthetic and real-world datasets.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Although the classical version of the Multi-Armed Bandits (MAB) framework has
been applied successfully to several practical problems, in many real-world
applications, the possible actions are not presented to the learner
simultaneously, such as in the Internet campaign management and environmental
monitoring settings. Instead, in such applications, a set of options is
presented sequentially to the learner within a time span, and this process is
repeated throughout a time horizon. At each time, the learner is asked whether
to select the proposed option or not. We define this scenario as the Sequential
Pull/No-pull Bandit setting, and we propose a meta-algorithm, namely Sequential
Pull/No-pull for MAB (Seq), to adapt any classical MAB policy to better suit
this setting for both the regret minimization and best-arm identification
problems. By allowing the selection of multiple arms within a round, the
proposed meta-algorithm gathers more information, especially in the first
rounds, characterized by a high uncertainty in the arms estimate value. At the
same time, the adapted algorithms provide the same theoretical guarantees as
the classical policy employed. The Seq meta-algorithm was extensively tested
and compared with classical MAB policies on synthetic and real-world datasets
from advertising and environmental monitoring applications, highlighting its
good empirical performances.
Related papers
- Contextual Bandits with Arm Request Costs and Delays [19.263086804406786]
We introduce a novel extension of the contextual bandit problem, where new sets of arms can be requested with time delays and associated costs.
In this setting, the learner can select multiple arms from a decision set, with each selection taking one unit of time.
We design algorithms that can effectively select arms and determine the appropriate time to request new arms, thereby minimizing their regret.
arXiv Detail & Related papers (2024-10-17T00:44:50Z) - An incremental preference elicitation-based approach to learning potentially non-monotonic preferences in multi-criteria sorting [53.36437745983783]
We first construct a max-margin optimization-based model to model potentially non-monotonic preferences.
We devise information amount measurement methods and question selection strategies to pinpoint the most informative alternative in each iteration.
Two incremental preference elicitation-based algorithms are developed to learn potentially non-monotonic preferences.
arXiv Detail & Related papers (2024-09-04T14:36:20Z) - Convergence of a L2 regularized Policy Gradient Algorithm for the Multi Armed Bandit [0.0]
Multi Armed Bandit (MAB) on one hand and the policy gradient approach on the other hand are among the most used frameworks of Reinforcement Learning.
We investigate in this work the convergence of such a procedure for the situation when a $L2$ regularization term is present jointly with the'softmax' parametrization.
arXiv Detail & Related papers (2024-02-09T13:10:04Z) - Pure Exploration under Mediators' Feedback [63.56002444692792]
Multi-armed bandits are a sequential-decision-making framework, where, at each interaction step, the learner selects an arm and observes a reward.
We consider the scenario in which the learner has access to a set of mediators, each of which selects the arms on the agent's behalf according to a and possibly unknown policy.
We propose a sequential decision-making strategy for discovering the best arm under the assumption that the mediators' policies are known to the learner.
arXiv Detail & Related papers (2023-08-29T18:18:21Z) - Online Continuous Hyperparameter Optimization for Generalized Linear Contextual Bandits [55.03293214439741]
In contextual bandits, an agent sequentially makes actions from a time-dependent action set based on past experience.
We propose the first online continuous hyperparameter tuning framework for contextual bandits.
We show that it could achieve a sublinear regret in theory and performs consistently better than all existing methods on both synthetic and real datasets.
arXiv Detail & Related papers (2023-02-18T23:31:20Z) - Generative Slate Recommendation with Reinforcement Learning [49.75985313698214]
reinforcement learning algorithms can be used to optimize user engagement in recommender systems.
However, RL approaches are intractable in the slate recommendation scenario.
In that setting, an action corresponds to a slate that may contain any combination of items.
In this work we propose to encode slates in a continuous, low-dimensional latent space learned by a variational auto-encoder.
We are able to (i) relax assumptions required by previous work, and (ii) improve the quality of the action selection by modeling full slates.
arXiv Detail & Related papers (2023-01-20T15:28:09Z) - Reinforcement Learning for Finite-Horizon Restless Multi-Armed
Multi-Action Bandits [8.136957953239254]
We study a finite-horizon restless multi-armed bandit problem with multiple actions dubbed R(MA)2B.
The state of each arm evolves according to a controlled Markov decision process (MDP), and the reward of pulling an arm depends on both the current state of the corresponding MDP and the action taken.
Since finding the optimal policy is typically intractable, we propose a computationally appealing index policy which we call Occupancy-Measured-Reward Index Policy.
arXiv Detail & Related papers (2021-09-20T21:40:12Z) - Max-Utility Based Arm Selection Strategy For Sequential Query
Recommendations [16.986870945319293]
We consider the query recommendation problem in closed loop interactive learning settings like online information gathering and exploratory analytics.
The problem can be naturally modelled using the Multi-Armed Bandits (MAB) framework with countably many arms.
We show that such a selection strategy often results in higher cumulative regret and to this end, we propose a selection strategy based on the maximum utility of the arms.
arXiv Detail & Related papers (2021-08-31T13:03:30Z) - Bayesian decision-making under misspecified priors with applications to
meta-learning [64.38020203019013]
Thompson sampling and other sequential decision-making algorithms are popular approaches to tackle explore/exploit trade-offs in contextual bandits.
We show that performance degrades gracefully with misspecified priors.
arXiv Detail & Related papers (2021-07-03T23:17:26Z) - Efficient Pure Exploration for Combinatorial Bandits with Semi-Bandit
Feedback [51.21673420940346]
Combinatorial bandits generalize multi-armed bandits, where the agent chooses sets of arms and observes a noisy reward for each arm contained in the chosen set.
We focus on the pure-exploration problem of identifying the best arm with fixed confidence, as well as a more general setting, where the structure of the answer set differs from the one of the action set.
Based on a projection-free online learning algorithm for finite polytopes, it is the first computationally efficient algorithm which is convexally optimal and has competitive empirical performance.
arXiv Detail & Related papers (2021-01-21T10:35:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.