Dueling Bandits: From Two-dueling to Multi-dueling
- URL: http://arxiv.org/abs/2211.10293v1
- Date: Wed, 16 Nov 2022 22:00:54 GMT
- Title: Dueling Bandits: From Two-dueling to Multi-dueling
- Authors: Yihan Du, Siwei Wang, Longbo Huang
- Abstract summary: We study a general multi-dueling bandit problem, where an agent compares multiple options simultaneously and aims to minimize the regret due to selecting suboptimal arms.
This setting generalizes the traditional two-dueling bandit problem and finds many real-world applications involving subjective feedback on multiple options.
- Score: 40.4378338001229
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study a general multi-dueling bandit problem, where an agent compares
multiple options simultaneously and aims to minimize the regret due to
selecting suboptimal arms. This setting generalizes the traditional two-dueling
bandit problem and finds many real-world applications involving subjective
feedback on multiple options. We start with the two-dueling bandit setting and
propose two efficient algorithms, DoublerBAI and MultiSBM-Feedback. DoublerBAI
provides a generic schema for translating known results on best arm
identification algorithms to the dueling bandit problem, and achieves a regret
bound of $O(\ln T)$. MultiSBM-Feedback not only has an optimal $O(\ln T)$
regret, but also reduces the constant factor by almost a half compared to
benchmark results. Then, we consider the general multi-dueling case and develop
an efficient algorithm MultiRUCB. Using a novel finite-time regret analysis for
the general multi-dueling bandit problem, we show that MultiRUCB also achieves
an $O(\ln T)$ regret bound and the bound tightens as the capacity of the
comparison set increases. Based on both synthetic and real-world datasets, we
empirically demonstrate that our algorithms outperform existing algorithms.
Related papers
- Feel-Good Thompson Sampling for Contextual Dueling Bandits [49.450050682705026]
We propose a Thompson sampling algorithm, named FGTS.CDB, for linear contextual dueling bandits.
At the core of our algorithm is a new Feel-Good exploration term specifically tailored for dueling bandits.
Our algorithm achieves nearly minimax-optimal regret, i.e., $tildemathcalO(dsqrt T)$, where $d$ is the model dimension and $T$ is the time horizon.
arXiv Detail & Related papers (2024-04-09T04:45:18Z) - Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed.
SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z) - Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM)
We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$.
Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z) - Batched Dueling Bandits [13.69077222007053]
We study the batched $K$-armed dueling bandit problem under two standard settings.
We obtain algorithms with a smooth trade-off between the number of batches and regret.
arXiv Detail & Related papers (2022-02-22T04:02:36Z) - Versatile Dueling Bandits: Best-of-both-World Analyses for Online
Learning from Preferences [28.79598714109439]
We study the problem of $K$-armed dueling bandit for both and adversarial environments.
We first propose a novel reduction from any (general) dueling bandits to multi-armed bandits.
Our algorithm is also the first to achieve an optimal $O(sum_i = 1K fraclog TDelta_i)$ regret bound against the Condorcet-winner benchmark.
arXiv Detail & Related papers (2022-02-14T13:37:23Z) - Efficient Pure Exploration for Combinatorial Bandits with Semi-Bandit
Feedback [51.21673420940346]
Combinatorial bandits generalize multi-armed bandits, where the agent chooses sets of arms and observes a noisy reward for each arm contained in the chosen set.
We focus on the pure-exploration problem of identifying the best arm with fixed confidence, as well as a more general setting, where the structure of the answer set differs from the one of the action set.
Based on a projection-free online learning algorithm for finite polytopes, it is the first computationally efficient algorithm which is convexally optimal and has competitive empirical performance.
arXiv Detail & Related papers (2021-01-21T10:35:09Z) - Upper Confidence Bounds for Combining Stochastic Bandits [52.10197476419621]
We provide a simple method to combine bandit algorithms.
Our approach is based on a "meta-UCB" procedure that treats each of $N$ individual bandit algorithms as arms in a higher-level $N$-armed bandit problem.
arXiv Detail & Related papers (2020-12-24T05:36:29Z) - Bandit algorithms to emulate human decision making using probabilistic
distortions [20.422725678982726]
We formulate two sample multi-armed bandit problems with distorted probabilities on the reward distributions.
We consider the aforementioned problems in the regret minimization as well as best arm identification framework for multi-armed bandits.
arXiv Detail & Related papers (2016-11-30T17:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.