Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When
Partial Feedback Counts
- URL: http://arxiv.org/abs/2206.00586v1
- Date: Wed, 1 Jun 2022 15:56:59 GMT
- Title: Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When
Partial Feedback Counts
- Authors: Giulia Romano, Andrea Agostini, Francesco Trov\`o, Nicola Gatti,
Marcello Restelli
- Abstract summary: We study a novel bandit setting, namely Multi-Armed Bandit with Temporally-Partitioned Rewards (TP-MAB)
This setting is a natural extension of delayed-feedback bandits to the case in which rewards may be dilated over a finite-time span after the pull.
We provide two algorithms to address TP-MAB problems, namely, TP-UCB-FR and TP-UCB-EW.
- Score: 53.579515853222986
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There is a rising interest in industrial online applications where data
becomes available sequentially. Inspired by the recommendation of playlists to
users where their preferences can be collected during the listening of the
entire playlist, we study a novel bandit setting, namely Multi-Armed Bandit
with Temporally-Partitioned Rewards (TP-MAB), in which the stochastic reward
associated with the pull of an arm is partitioned over a finite number of
consecutive rounds following the pull. This setting, unexplored so far to the
best of our knowledge, is a natural extension of delayed-feedback bandits to
the case in which rewards may be dilated over a finite-time span after the pull
instead of being fully disclosed in a single, potentially delayed round. We
provide two algorithms to address TP-MAB problems, namely, TP-UCB-FR and
TP-UCB-EW, which exploit the partial information disclosed by the reward
collected over time. We show that our algorithms provide better asymptotical
regret upper bounds than delayed-feedback bandit algorithms when a property
characterizing a broad set of reward structures of practical interest, namely
alpha-smoothness, holds. We also empirically evaluate their performance across
a wide range of settings, both synthetically generated and from a real-world
media recommendation problem.
Related papers
- Biased Dueling Bandits with Stochastic Delayed Feedback [6.167074802065416]
We present two algorithms designed to handle situations involving delay.
Our first algorithm, requiring complete delay distribution information, achieves the optimal regret bound for the dueling bandit problem when there is no delay.
The second algorithm is tailored for situations where the distribution is unknown, but only the expected value of delay is available.
arXiv Detail & Related papers (2024-08-26T19:49:12Z) - Multi-Armed Bandits with Abstention [62.749500564313834]
We introduce a novel extension of the canonical multi-armed bandit problem that incorporates an additional strategic element: abstention.
In this enhanced framework, the agent is not only tasked with selecting an arm at each time step, but also has the option to abstain from accepting the instantaneous reward before observing it.
arXiv Detail & Related papers (2024-02-23T06:27:12Z) - An Optimal Algorithm for the Real-Valued Combinatorial Pure Exploration
of Multi-Armed Bandit [65.268245109828]
We study the real-valued pure exploration problem in the multi-armed bandit (R-CPE-MAB)
Existing methods in the R-CPE-MAB can be seen as a special case of the so-called transductive linear bandits.
We propose an algorithm named the gap-based exploration (CombGapE) algorithm, whose sample complexity matches the lower bound.
arXiv Detail & Related papers (2023-06-15T15:37:31Z) - Multi-Armed Bandits with Generalized Temporally-Partitioned Rewards [0.4194295877935867]
In some real-world applications, feedback about a decision is delayed and may arrive via partial rewards that are observed with different delays.
We propose a novel problem formulation called multi-armed bandits with generalized temporally-partitioned rewards.
We derive a lower bound on the performance of any uniformly efficient algorithm for the considered problem.
arXiv Detail & Related papers (2023-03-01T16:22:22Z) - Reward Imputation with Sketching for Contextual Batched Bandits [48.80803376405073]
Contextual batched bandit (CBB) is a setting where a batch of rewards is observed from the environment at the end of each episode.
Existing approaches for CBB often ignore the rewards of the non-executed actions, leading to underutilization of feedback information.
We propose Sketched Policy Updating with Imputed Rewards (SPUIR) that completes the unobserved rewards using sketching.
arXiv Detail & Related papers (2022-10-13T04:26:06Z) - Break your Bandit Routine with LSD Rewards: a Last Switch Dependent
Analysis of Satiation and Seasonality [6.146046338698175]
We introduce a novel non-stationary bandit problem, where the expected reward of an arm is fully determined by the time elapsed since the arm last took part in a switch of actions.
Our model generalizes previous notions of delay-dependent rewards, and also relaxes most assumptions on the reward function.
We prove an algorithm and prove a bound on its regret with respect to the optimal non-stationary policy.
arXiv Detail & Related papers (2021-10-22T14:53:13Z) - Batched Neural Bandits [107.5072688105936]
BatchNeuralUCB combines neural networks with optimism to address the exploration-exploitation tradeoff.
We prove that BatchNeuralUCB achieves the same regret as the fully sequential version while reducing the number of policy updates considerably.
arXiv Detail & Related papers (2021-02-25T17:36:44Z) - Confidence-Budget Matching for Sequential Budgeted Learning [69.77435313099366]
We formalize decision-making problems with querying budget.
We consider multi-armed bandits, linear bandits, and reinforcement learning problems.
We show that CBM based algorithms perform well in the presence of adversity.
arXiv Detail & Related papers (2021-02-05T19:56:31Z) - Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous
Feedback [32.62857394584907]
We study the multi-armed bandit (MAB) problem with composite and anonymous feedback.
We propose adaptive algorithms for both the adversarial and non- adversarial cases.
arXiv Detail & Related papers (2020-12-13T12:25:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.