Batched Dueling Bandits
- URL: http://arxiv.org/abs/2202.10660v1
- Date: Tue, 22 Feb 2022 04:02:36 GMT
- Title: Batched Dueling Bandits
- Authors: Arpit Agarwal, Rohan Ghuge, Viswanath Nagarajan
- Abstract summary: We study the batched $K$-armed dueling bandit problem under two standard settings.
We obtain algorithms with a smooth trade-off between the number of batches and regret.
- Score: 13.69077222007053
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The $K$-armed dueling bandit problem, where the feedback is in the form of
noisy pairwise comparisons, has been widely studied. Previous works have only
focused on the sequential setting where the policy adapts after every
comparison. However, in many applications such as search ranking and
recommendation systems, it is preferable to perform comparisons in a limited
number of parallel batches. We study the batched $K$-armed dueling bandit
problem under two standard settings: (i) existence of a Condorcet winner, and
(ii) strong stochastic transitivity and stochastic triangle inequality. For
both settings, we obtain algorithms with a smooth trade-off between the number
of batches and regret. Our regret bounds match the best known sequential regret
bounds (up to poly-logarithmic factors), using only a logarithmic number of
batches. We complement our regret analysis with a nearly-matching lower bound.
Finally, we also validate our theoretical results via experiments on synthetic
and real data.
Related papers
- Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms.
We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z) - Feel-Good Thompson Sampling for Contextual Dueling Bandits [49.450050682705026]
We propose a Thompson sampling algorithm, named FGTS.CDB, for linear contextual dueling bandits.
At the core of our algorithm is a new Feel-Good exploration term specifically tailored for dueling bandits.
Our algorithm achieves nearly minimax-optimal regret, i.e., $tildemathcalO(dsqrt T)$, where $d$ is the model dimension and $T$ is the time horizon.
arXiv Detail & Related papers (2024-04-09T04:45:18Z) - Provably Efficient High-Dimensional Bandit Learning with Batched
Feedbacks [93.00280593719513]
We study high-dimensional multi-armed contextual bandits with batched feedback where the $T$ steps of online interactions are divided into $L$ batches.
In specific, each batch collects data according to a policy that depends on previous batches and the rewards are revealed only at the end of the batch.
Our algorithm achieves regret bounds comparable to those in fully sequential setting with only $mathcalO( log T)$ batches.
arXiv Detail & Related papers (2023-11-22T06:06:54Z) - Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM)
We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$.
Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z) - Thompson Exploration with Best Challenger Rule in Best Arm
Identification [66.33448474838342]
We study the fixed-confidence best arm identification problem in the bandit framework.
We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z) - Dueling Bandits: From Two-dueling to Multi-dueling [40.4378338001229]
We study a general multi-dueling bandit problem, where an agent compares multiple options simultaneously and aims to minimize the regret due to selecting suboptimal arms.
This setting generalizes the traditional two-dueling bandit problem and finds many real-world applications involving subjective feedback on multiple options.
arXiv Detail & Related papers (2022-11-16T22:00:54Z) - An Asymptotically Optimal Batched Algorithm for the Dueling Bandit
Problem [13.69077222007053]
We study the $K$-armed dueling bandit problem, a variation of the traditional multi-armed bandit problem in which feedback is obtained in the form of pairwise comparisons.
We obtain regret of $O(K2log(K)) + O(Klog(T))$ in $O(log(T))$ rounds, where $T$ is the time horizon.
In computational experiments over a variety of real-world datasets, we observe that our algorithm using $O(log(T))$ rounds achieves almost the same performance as fully
arXiv Detail & Related papers (2022-09-25T00:23:55Z) - The Impact of Batch Learning in Stochastic Linear Bandits [7.3449418475577595]
We consider a special case of bandit problems, named batched bandits, in which an agent observes batches of responses over a certain time period.
Our main theoretical results show that the impact of batch learning can be measured proportional to the regret of online behavior.
arXiv Detail & Related papers (2022-02-14T12:27:06Z) - Batched Neural Bandits [107.5072688105936]
BatchNeuralUCB combines neural networks with optimism to address the exploration-exploitation tradeoff.
We prove that BatchNeuralUCB achieves the same regret as the fully sequential version while reducing the number of policy updates considerably.
arXiv Detail & Related papers (2021-02-25T17:36:44Z) - Regret Minimization in Stochastic Contextual Dueling Bandits [40.17224226373741]
We consider the problem of $K$-armed dueling bandit in the contextual setting.
We present two algorithms for the setup with respective regret guarantees.
arXiv Detail & Related papers (2020-02-20T06:36:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.