Output-Weighted Sampling for Multi-Armed Bandits with Extreme Payoffs
- URL: http://arxiv.org/abs/2102.10085v1
- Date: Fri, 19 Feb 2021 18:36:03 GMT
- Title: Output-Weighted Sampling for Multi-Armed Bandits with Extreme Payoffs
- Authors: Yibo Yang, Antoine Blanchard, Themistoklis Sapsis, Paris Perdikaris
- Abstract summary: We present a new type of acquisition functions for online decision making in bandit problems with extreme payoffs.
We formulate a novel type of upper confidence bound (UCB) acquisition function that guides exploration towards the bandits that are deemed most relevant.
- Score: 11.1546439770774
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a new type of acquisition functions for online decision making in
multi-armed and contextual bandit problems with extreme payoffs. Specifically,
we model the payoff function as a Gaussian process and formulate a novel type
of upper confidence bound (UCB) acquisition function that guides exploration
towards the bandits that are deemed most relevant according to the variability
of the observed rewards. This is achieved by computing a tractable likelihood
ratio that quantifies the importance of the output relative to the inputs and
essentially acts as an \textit{attention mechanism} that promotes exploration
of extreme rewards. We demonstrate the benefits of the proposed methodology
across several synthetic benchmarks, as well as a realistic example involving
noisy sensor network data. Finally, we provide a JAX library for efficient
bandit optimization using Gaussian processes.
Related papers
- Batch Ensemble for Variance Dependent Regret in Stochastic Bandits [41.95653110232677]
Efficiently trading off exploration and exploitation is one of the key challenges in online Reinforcement Learning (RL)
Inspired by practical ensemble methods, in this work we propose a simple and novel batch ensemble scheme that achieves near-optimal regret for Multi-Armed Bandits (MAB)
Our algorithm has just a single parameter namely the number of batches, and its value does not depend on distributional properties such as the scale and variance of the losses.
arXiv Detail & Related papers (2024-09-13T06:40:56Z) - Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms.
We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z) - TS-RSR: A provably efficient approach for batch bayesian optimization [4.622871908358325]
This paper presents a new approach for batch Bayesian Optimization (BO) called Thompson Sampling-Regret to Sigma Ratio directed sampling.
Our sampling objective is able to coordinate the actions chosen in each batch in a way that minimizes redundancy between points.
We demonstrate that our method attains state-of-the-art performance on a range of challenging synthetic and realistic test functions.
arXiv Detail & Related papers (2024-03-07T18:58:26Z) - Best Arm Identification with Fixed Budget: A Large Deviation Perspective [54.305323903582845]
We present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
In particular, we present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
arXiv Detail & Related papers (2023-12-19T13:17:43Z) - Stochastic Rising Bandits [40.32303434592863]
We study a particular case of the rested and restless bandits in which the arms' expected payoff is monotonically non-decreasing.
This characteristic allows designing specifically crafted algorithms that exploit the regularity of the payoffs to provide tight regret bounds.
We empirically compare our algorithms with state-of-the-art methods for non-stationary MABs over several synthetically generated tasks and an online model selection problem for a real-world dataset.
arXiv Detail & Related papers (2022-12-07T17:30:45Z) - Thompson Sampling with Virtual Helping Agents [0.0]
We address the problem of online sequential decision making, i.e., balancing the trade-off between exploiting current knowledge to maximize immediate performance and exploring the new information to gain long-term benefits.
We propose two algorithms for the multi-armed bandit problem and provide theoretical bounds on the cumulative regret.
arXiv Detail & Related papers (2022-09-16T23:34:44Z) - Thompson Sampling on Asymmetric $\alpha$-Stable Bandits [0.0]
Multi-armed bandit problem can optimize the proposed solutions by changing the reward distribution.
Thompson Sampling is a common method for solving multi-armed bandit problem.
arXiv Detail & Related papers (2022-03-19T01:55:08Z) - On Reward-Free RL with Kernel and Neural Function Approximations:
Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function.
We tackle this problem under the context of function approximation, leveraging powerful function approximators.
We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z) - Bias-Robust Bayesian Optimization via Dueling Bandit [57.82422045437126]
We consider Bayesian optimization in settings where observations can be adversarially biased.
We propose a novel approach for dueling bandits based on information-directed sampling (IDS)
Thereby, we obtain the first efficient kernelized algorithm for dueling bandits that comes with cumulative regret guarantees.
arXiv Detail & Related papers (2021-05-25T10:08:41Z) - Learning from eXtreme Bandit Feedback [105.0383130431503]
We study the problem of batch learning from bandit feedback in the setting of extremely large action spaces.
In this paper, we introduce a selective importance sampling estimator (sIS) that operates in a significantly more favorable bias-variance regime.
We employ this estimator in a novel algorithmic procedure -- named Policy Optimization for eXtreme Models (POXM) -- for learning from bandit feedback on XMC tasks.
arXiv Detail & Related papers (2020-09-27T20:47:25Z) - Robustness Guarantees for Mode Estimation with an Application to Bandits [131.21717367564963]
We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean.
We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences.
arXiv Detail & Related papers (2020-03-05T21:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.