Bias-Robust Bayesian Optimization via Dueling Bandit
- URL: http://arxiv.org/abs/2105.11802v1
- Date: Tue, 25 May 2021 10:08:41 GMT
- Title: Bias-Robust Bayesian Optimization via Dueling Bandit
- Authors: Johannes Kirschner and Andreas Krause
- Abstract summary: We consider Bayesian optimization in settings where observations can be adversarially biased.
We propose a novel approach for dueling bandits based on information-directed sampling (IDS)
Thereby, we obtain the first efficient kernelized algorithm for dueling bandits that comes with cumulative regret guarantees.
- Score: 57.82422045437126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider Bayesian optimization in settings where observations can be
adversarially biased, for example by an uncontrolled hidden confounder. Our
first contribution is a reduction of the confounded setting to the dueling
bandit model. Then we propose a novel approach for dueling bandits based on
information-directed sampling (IDS). Thereby, we obtain the first efficient
kernelized algorithm for dueling bandits that comes with cumulative regret
guarantees. Our analysis further generalizes a previously proposed
semi-parametric linear bandit model to non-linear reward functions, and
uncovers interesting links to doubly-robust estimation.
Related papers
- Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms.
We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z) - Bayesian Bandit Algorithms with Approximate Inference in Stochastic Linear Bandits [21.09844002135398]
We show that Linear Thompson sampling (LinTS) and the extension of Bayesian Upper Confidence Bound (LinBUCB) can preserve their original rates of regret upper bound.
We also show that LinBUCB expedites the regret rate of LinTS from $tildeO(d3/2sqrtT)$ to $tildeO(dsqrtT)$ matching the minimax optimal rate.
arXiv Detail & Related papers (2024-06-20T07:45:38Z) - Feel-Good Thompson Sampling for Contextual Dueling Bandits [49.450050682705026]
We propose a Thompson sampling algorithm, named FGTS.CDB, for linear contextual dueling bandits.
At the core of our algorithm is a new Feel-Good exploration term specifically tailored for dueling bandits.
Our algorithm achieves nearly minimax-optimal regret, i.e., $tildemathcalO(dsqrt T)$, where $d$ is the model dimension and $T$ is the time horizon.
arXiv Detail & Related papers (2024-04-09T04:45:18Z) - Incentivizing Exploration with Linear Contexts and Combinatorial Actions [9.15749739027059]
In incentivized bandit exploration, arm choices are viewed as recommendations and are required to be Bayesian incentive compatible.
Recent work has shown under certain independence assumptions that after collecting enough initial samples, the popular Thompson sampling algorithm becomes incentive compatible.
We give an analog of this result for linear bandits, where the independence of the prior is replaced by a natural convexity condition.
arXiv Detail & Related papers (2023-06-03T03:30:42Z) - Contextual bandits with concave rewards, and an application to fair
ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR)
We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem.
Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z) - Langevin Monte Carlo for Contextual Bandits [72.00524614312002]
Langevin Monte Carlo Thompson Sampling (LMC-TS) is proposed to directly sample from the posterior distribution in contextual bandits.
We prove that the proposed algorithm achieves the same sublinear regret bound as the best Thompson sampling algorithms for a special case of contextual bandits.
arXiv Detail & Related papers (2022-06-22T17:58:23Z) - Versatile Dueling Bandits: Best-of-both-World Analyses for Online
Learning from Preferences [28.79598714109439]
We study the problem of $K$-armed dueling bandit for both and adversarial environments.
We first propose a novel reduction from any (general) dueling bandits to multi-armed bandits.
Our algorithm is also the first to achieve an optimal $O(sum_i = 1K fraclog TDelta_i)$ regret bound against the Condorcet-winner benchmark.
arXiv Detail & Related papers (2022-02-14T13:37:23Z) - Uncertainty-Aware Abstractive Summarization [3.1423034006764965]
We propose a novel approach to summarization based on Bayesian deep learning.
We show that our variational equivalents of BART and PEG can outperform their deterministic counterparts on multiple benchmark datasets.
Having a reliable uncertainty measure, we can improve the experience of the end user by filtering generated summaries of high uncertainty.
arXiv Detail & Related papers (2021-05-21T06:36:40Z) - Optimistic Policy Optimization with Bandit Feedback [70.75568142146493]
We propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $tilde O(sqrtS2 A H4 K)$ regret for previous rewards.
To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
arXiv Detail & Related papers (2020-02-19T15:41:18Z) - Bandit algorithms to emulate human decision making using probabilistic
distortions [20.422725678982726]
We formulate two sample multi-armed bandit problems with distorted probabilities on the reward distributions.
We consider the aforementioned problems in the regret minimization as well as best arm identification framework for multi-armed bandits.
arXiv Detail & Related papers (2016-11-30T17:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.