Gaussian Process Bandits with Aggregated Feedback
- URL: http://arxiv.org/abs/2112.13029v1
- Date: Fri, 24 Dec 2021 11:03:00 GMT
- Title: Gaussian Process Bandits with Aggregated Feedback
- Authors: Mengyan Zhang, Russell Tsuchida, Cheng Soon Ong
- Abstract summary: We consider the continuum-armed bandits problem, under a novel setting of recommending the best arms within a fixed budget under aggregated feedback.
This is motivated by applications where the precise rewards are impossible or expensive to obtain, while an aggregated reward or feedback, such as the average over a subset, is available.
We propose a new simple regret notion with respect to aggregated feedback on the recommended arms.
- Score: 8.667190358712062
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the continuum-armed bandits problem, under a novel setting of
recommending the best arms within a fixed budget under aggregated feedback.
This is motivated by applications where the precise rewards are impossible or
expensive to obtain, while an aggregated reward or feedback, such as the
average over a subset, is available. We constrain the set of reward functions
by assuming that they are from a Gaussian Process and propose the Gaussian
Process Optimistic Optimisation (GPOO) algorithm. We adaptively construct a
tree with nodes as subsets of the arm space, where the feedback is the
aggregated reward of representatives of a node. We propose a new simple regret
notion with respect to aggregated feedback on the recommended arms. We provide
theoretical analysis for the proposed algorithm, and recover single point
feedback as a special case. We illustrate GPOO and compare it with related
algorithms on simulated data.
Related papers
- Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms.
We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z) - Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed.
SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z) - Combinatorial Bandits for Maximum Value Reward Function under Max
Value-Index Feedback [9.771002043127728]
We consider a multi-armed bandit problem for maximum value reward function under maximum value and index feedback.
We propose an algorithm and provide a regret bound for problem instances with arm outcomes according to arbitrary distributions with finite supports.
Our algorithm achieves a $O((k/Delta)log(T))$ distribution-dependent and a $tildeO(sqrtT)$ distribution-independent regret.
arXiv Detail & Related papers (2023-05-25T14:02:12Z) - Contextual bandits with concave rewards, and an application to fair
ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR)
We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem.
Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z) - Langevin Monte Carlo for Contextual Bandits [72.00524614312002]
Langevin Monte Carlo Thompson Sampling (LMC-TS) is proposed to directly sample from the posterior distribution in contextual bandits.
We prove that the proposed algorithm achieves the same sublinear regret bound as the best Thompson sampling algorithms for a special case of contextual bandits.
arXiv Detail & Related papers (2022-06-22T17:58:23Z) - Finding Optimal Arms in Non-stochastic Combinatorial Bandits with
Semi-bandit Feedback and Finite Budget [6.759124697337311]
We consider the bandits problem with semi-bandit feedback under finite sampling budget constraints.
The action is to choose a set of arms, whereupon feedback for each arm in the chosen set is received.
We suggest a generic algorithm suitable to cover the full spectrum of conceivable arm elimination strategies.
arXiv Detail & Related papers (2022-02-09T14:36:05Z) - Risk-Aware Algorithms for Combinatorial Semi-Bandits [7.716156977428555]
We study the multi-armed bandit problem under semi-bandit feedback.
We consider the problem of maximizing the Conditional Value-at-Risk (CVaR), a risk measure that takes into account only the worst-case rewards.
We propose new algorithms that maximize the CVaR of the rewards obtained from the super arms of the bandit.
arXiv Detail & Related papers (2021-12-02T11:29:43Z) - Bias-Robust Bayesian Optimization via Dueling Bandit [57.82422045437126]
We consider Bayesian optimization in settings where observations can be adversarially biased.
We propose a novel approach for dueling bandits based on information-directed sampling (IDS)
Thereby, we obtain the first efficient kernelized algorithm for dueling bandits that comes with cumulative regret guarantees.
arXiv Detail & Related papers (2021-05-25T10:08:41Z) - Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous
Feedback [32.62857394584907]
We study the multi-armed bandit (MAB) problem with composite and anonymous feedback.
We propose adaptive algorithms for both the adversarial and non- adversarial cases.
arXiv Detail & Related papers (2020-12-13T12:25:41Z) - Optimistic Policy Optimization with Bandit Feedback [70.75568142146493]
We propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $tilde O(sqrtS2 A H4 K)$ regret for previous rewards.
To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
arXiv Detail & Related papers (2020-02-19T15:41:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.