Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous
Feedback
- URL: http://arxiv.org/abs/2012.07048v2
- Date: Tue, 15 Dec 2020 11:44:06 GMT
- Title: Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous
Feedback
- Authors: Siwei Wang, Haoyun Wang, Longbo Huang
- Abstract summary: We study the multi-armed bandit (MAB) problem with composite and anonymous feedback.
We propose adaptive algorithms for both the adversarial and non- adversarial cases.
- Score: 32.62857394584907
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the multi-armed bandit (MAB) problem with composite and anonymous
feedback. In this model, the reward of pulling an arm spreads over a period of
time (we call this period as reward interval) and the player receives partial
rewards of the action, convoluted with rewards from pulling other arms,
successively. Existing results on this model require prior knowledge about the
reward interval size as an input to their algorithms. In this paper, we propose
adaptive algorithms for both the stochastic and the adversarial cases, without
requiring any prior information about the reward interval. For the stochastic
case, we prove that our algorithm guarantees a regret that matches the lower
bounds (in order). For the adversarial case, we propose the first algorithm to
jointly handle non-oblivious adversary and unknown reward interval size. We
also conduct simulations based on real-world dataset. The results show that our
algorithms outperform existing benchmarks.
Related papers
- Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms.
We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z) - Replicability is Asymptotically Free in Multi-armed Bandits [45.729105054410745]
This work is motivated by the growing demand for reproducible machine learning.
In particular, we consider a replicable algorithm that ensures, with high probability, that the algorithm's sequence of actions is not affected by the randomness inherent in the dataset.
arXiv Detail & Related papers (2024-02-12T03:31:34Z) - Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When
Partial Feedback Counts [53.579515853222986]
We study a novel bandit setting, namely Multi-Armed Bandit with Temporally-Partitioned Rewards (TP-MAB)
This setting is a natural extension of delayed-feedback bandits to the case in which rewards may be dilated over a finite-time span after the pull.
We provide two algorithms to address TP-MAB problems, namely, TP-UCB-FR and TP-UCB-EW.
arXiv Detail & Related papers (2022-06-01T15:56:59Z) - Gaussian Process Bandits with Aggregated Feedback [8.667190358712062]
We consider the continuum-armed bandits problem, under a novel setting of recommending the best arms within a fixed budget under aggregated feedback.
This is motivated by applications where the precise rewards are impossible or expensive to obtain, while an aggregated reward or feedback, such as the average over a subset, is available.
We propose a new simple regret notion with respect to aggregated feedback on the recommended arms.
arXiv Detail & Related papers (2021-12-24T11:03:00Z) - Risk-Aware Algorithms for Combinatorial Semi-Bandits [7.716156977428555]
We study the multi-armed bandit problem under semi-bandit feedback.
We consider the problem of maximizing the Conditional Value-at-Risk (CVaR), a risk measure that takes into account only the worst-case rewards.
We propose new algorithms that maximize the CVaR of the rewards obtained from the super arms of the bandit.
arXiv Detail & Related papers (2021-12-02T11:29:43Z) - Efficient Pure Exploration for Combinatorial Bandits with Semi-Bandit
Feedback [51.21673420940346]
Combinatorial bandits generalize multi-armed bandits, where the agent chooses sets of arms and observes a noisy reward for each arm contained in the chosen set.
We focus on the pure-exploration problem of identifying the best arm with fixed confidence, as well as a more general setting, where the structure of the answer set differs from the one of the action set.
Based on a projection-free online learning algorithm for finite polytopes, it is the first computationally efficient algorithm which is convexally optimal and has competitive empirical performance.
arXiv Detail & Related papers (2021-01-21T10:35:09Z) - Resource Allocation in Multi-armed Bandit Exploration: Overcoming
Sublinear Scaling with Adaptive Parallelism [107.48538091418412]
We study exploration in multi-armed bandits when we have access to a divisible resource that can be allocated in varying amounts to arm pulls.
We focus in particular on the allocation of distributed computing resources, where we may obtain results faster by allocating more resources per pull.
arXiv Detail & Related papers (2020-10-31T18:19:29Z) - Corralling Stochastic Bandit Algorithms [54.10645564702416]
We show that the regret of the corralling algorithms is no worse than that of the best algorithm containing the arm with the highest reward.
We show that the gap between the highest reward and other rewards depends on the gap between the highest reward and other rewards.
arXiv Detail & Related papers (2020-06-16T15:33:12Z) - Robustness Guarantees for Mode Estimation with an Application to Bandits [131.21717367564963]
We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean.
We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences.
arXiv Detail & Related papers (2020-03-05T21:29:27Z) - Blocking Bandits [33.14975454724348]
We consider a novel multi-armed bandit setting, where playing an arm makes it unavailable for a fixed number of time slots thereafter.
We show that with prior knowledge of the rewards and delays of all the arms, the problem of optimizing cumulative reward does not admit any pseudo-polynomial time algorithm.
We design a UCB based algorithm which is shown to have $c log T + o(log T)$ cumulative regret against the greedy algorithm.
arXiv Detail & Related papers (2019-07-27T20:42:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.