Related papers: Versatile Dueling Bandits: Best-of-both-World Analyses for Online Learning from Preferences

Versatile Dueling Bandits: Best-of-both-World Analyses for Online Learning from Preferences

URL: http://arxiv.org/abs/2202.06694v1
Date: Mon, 14 Feb 2022 13:37:23 GMT
Title: Versatile Dueling Bandits: Best-of-both-World Analyses for Online Learning from Preferences
Authors: Aadirupa Saha and Pierre Gaillard
Abstract summary: We study the problem of $K$-armed dueling bandit for both and adversarial environments. We first propose a novel reduction from any (general) dueling bandits to multi-armed bandits. Our algorithm is also the first to achieve an optimal $O(sum_i = 1K fraclog TDelta_i)$ regret bound against the Condorcet-winner benchmark.
Score: 28.79598714109439
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the problem of $K$-armed dueling bandit for both stochastic and adversarial environments, where the goal of the learner is to aggregate information through relative preferences of pair of decisions points queried in an online sequential manner. We first propose a novel reduction from any (general) dueling bandits to multi-armed bandits and despite the simplicity, it allows us to improve many existing results in dueling bandits. In particular, \emph{we give the first best-of-both world result for the dueling bandits regret minimization problem} -- a unified framework that is guaranteed to perform optimally for both stochastic and adversarial preferences simultaneously. Moreover, our algorithm is also the first to achieve an optimal $O(\sum_{i = 1}^K \frac{\log T}{\Delta_i})$ regret bound against the Condorcet-winner benchmark, which scales optimally both in terms of the arm-size $K$ and the instance-specific suboptimality gaps $\{\Delta_i\}_{i = 1}^K$. This resolves the long-standing problem of designing an instancewise gap-dependent order optimal regret algorithm for dueling bandits (with matching lower bounds up to small constant factors). We further justify the robustness of our proposed algorithm by proving its optimal regret rate under adversarially corrupted preferences -- this outperforms the existing state-of-the-art corrupted dueling results by a large margin. In summary, we believe our reduction idea will find a broader scope in solving a diverse class of dueling bandits setting, which are otherwise studied separately from multi-armed bandits with often more complex solutions and worse guarantees. The efficacy of our proposed algorithms is empirically corroborated against the existing dueling bandit methods.

Related papers

Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed. SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z)
Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM) We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$. Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z)
Dueling Bandits: From Two-dueling to Multi-dueling [40.4378338001229]
We study a general multi-dueling bandit problem, where an agent compares multiple options simultaneously and aims to minimize the regret due to selecting suboptimal arms. This setting generalizes the traditional two-dueling bandit problem and finds many real-world applications involving subjective feedback on multiple options.
arXiv Detail & Related papers (2022-11-16T22:00:54Z)
Dual Instrumental Method for Confounded Kernelized Bandits [0.0]
The contextual bandit problem is a framework with wide applications in various fields. We propose a confounded bandit problem where the noise becomes a latent confounder that affects both contexts and rewards. We show that a dual instrumental variable regression can correctly identify the true reward function.
arXiv Detail & Related papers (2022-09-07T15:25:57Z)
Max-Min Grouped Bandits [48.62520520818357]
We introduce a multi-armed bandit problem termed max-min grouped bandits. The goal is to find a group whose worst arm has the highest mean reward. This problem is of interest to applications such as recommendation systems.
arXiv Detail & Related papers (2021-11-17T01:59:15Z)
Bias-Robust Bayesian Optimization via Dueling Bandit [57.82422045437126]
We consider Bayesian optimization in settings where observations can be adversarially biased. We propose a novel approach for dueling bandits based on information-directed sampling (IDS) Thereby, we obtain the first efficient kernelized algorithm for dueling bandits that comes with cumulative regret guarantees.
arXiv Detail & Related papers (2021-05-25T10:08:41Z)
Efficient Pure Exploration for Combinatorial Bandits with Semi-Bandit Feedback [51.21673420940346]
Combinatorial bandits generalize multi-armed bandits, where the agent chooses sets of arms and observes a noisy reward for each arm contained in the chosen set. We focus on the pure-exploration problem of identifying the best arm with fixed confidence, as well as a more general setting, where the structure of the answer set differs from the one of the action set. Based on a projection-free online learning algorithm for finite polytopes, it is the first computationally efficient algorithm which is convexally optimal and has competitive empirical performance.
arXiv Detail & Related papers (2021-01-21T10:35:09Z)
Upper Confidence Bounds for Combining Stochastic Bandits [52.10197476419621]
We provide a simple method to combine bandit algorithms. Our approach is based on a "meta-UCB" procedure that treats each of $N$ individual bandit algorithms as arms in a higher-level $N$-armed bandit problem.
arXiv Detail & Related papers (2020-12-24T05:36:29Z)
Regret Minimization in Stochastic Contextual Dueling Bandits [40.17224226373741]
We consider the problem of $K$-armed dueling bandit in the contextual setting. We present two algorithms for the setup with respective regret guarantees.
arXiv Detail & Related papers (2020-02-20T06:36:19Z)
Bandit algorithms to emulate human decision making using probabilistic distortions [20.422725678982726]
We formulate two sample multi-armed bandit problems with distorted probabilities on the reward distributions. We consider the aforementioned problems in the regret minimization as well as best arm identification framework for multi-armed bandits.
arXiv Detail & Related papers (2016-11-30T17:37:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.