Adversarial Dueling Bandits
- URL: http://arxiv.org/abs/2010.14563v1
- Date: Tue, 27 Oct 2020 19:09:08 GMT
- Title: Adversarial Dueling Bandits
- Authors: Aadirupa Saha, Tomer Koren, Yishay Mansour
- Abstract summary: We introduce the problem of regret in Adversarial Dueling Bandits.
The learner has to repeatedly choose a pair of items and observe only a relative binary win-loss' feedback for this pair.
Our main result is an algorithm whose $T$-round regret compared to the emphBorda-winner from a set of $K$ items.
- Score: 85.14061196945599
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce the problem of regret minimization in Adversarial Dueling
Bandits. As in classic Dueling Bandits, the learner has to repeatedly choose a
pair of items and observe only a relative binary `win-loss' feedback for this
pair, but here this feedback is generated from an arbitrary preference matrix,
possibly chosen adversarially. Our main result is an algorithm whose $T$-round
regret compared to the \emph{Borda-winner} from a set of $K$ items is
$\tilde{O}(K^{1/3}T^{2/3})$, as well as a matching $\Omega(K^{1/3}T^{2/3})$
lower bound. We also prove a similar high probability regret bound. We further
consider a simpler \emph{fixed-gap} adversarial setup, which bridges between
two extreme preference feedback models for dueling bandits: stationary
preferences and an arbitrary sequence of preferences. For the fixed-gap
adversarial setup we give an $\smash{ \tilde{O}((K/\Delta^2)\log{T}) }$ regret
algorithm, where $\Delta$ is the gap in Borda scores between the best item and
all other items, and show a lower bound of $\Omega(K/\Delta^2)$ indicating that
our dependence on the main problem parameters $K$ and $\Delta$ is tight (up to
logarithmic factors).
Related papers
- Adversarial Multi-dueling Bandits [0.4467766778351321]
We introduce the problem of regret in adversarial multi-dueling bandits.
We introduce a novel algorithm, MiDEX (Multi Dueling EXP3), to learn from such preference feedback.
arXiv Detail & Related papers (2024-06-18T10:28:12Z) - Adversarial Combinatorial Bandits with Switching Costs [55.2480439325792]
We study the problem of adversarial bandit with a switching cost $lambda$ for a switch of each selected arm in each round.
We derive lower bounds for the minimax regret and design algorithms to approach them.
arXiv Detail & Related papers (2024-04-02T12:15:37Z) - Improved Regret for Bandit Convex Optimization with Delayed Feedback [50.46856739179311]
bandit convex optimization (BCO) with delayed feedback, where only the loss value of the action is revealed under a delay.
We develop a novel algorithm, and prove that it enjoys a regret bound of $O(sqrtnT3/4+sqrtdT)$ in general.
We show that the proposed algorithm can improve the regret bound to $O((nT)2/3log/3T+dlog T)$ for strongly convex functions.
arXiv Detail & Related papers (2024-02-14T13:08:26Z) - Borda Regret Minimization for Generalized Linear Dueling Bandits [65.09919504862496]
We study the Borda regret minimization problem for dueling bandits, which aims to identify the item with the highest Borda score.
We propose a rich class of generalized linear dueling bandit models, which cover many existing models.
Our algorithm achieves an $tildeO(d2/3 T2/3)$ regret, which is also optimal.
arXiv Detail & Related papers (2023-03-15T17:59:27Z) - Better Best of Both Worlds Bounds for Bandits with Switching Costs [37.71741656687868]
We study best-of-both-worlds algorithms for bandits with switching cost, recently addressed by Rouyer, Seldin and Cesa-Bianchi, 2021.
We introduce a surprisingly simple and effective that simultaneously achieves minimax optimal regret bound of $mathcalO(T2/3)$ in the oblivious adversarial setting.
arXiv Detail & Related papers (2022-06-07T08:22:56Z) - Bounded Memory Adversarial Bandits with Composite Anonymous Delayed
Feedback [10.179145161000315]
We study the adversarial bandit problem with composite anonymous delayed feedback.
In this setting, losses of an action are split into $d$ components, spreading over consecutive rounds after the action is chosen.
We show non-oblivious setting incurs $Omega(T)$ pseudo regret even when the loss sequence is bounded memory.
arXiv Detail & Related papers (2022-04-27T08:32:35Z) - Bandits with many optimal arms [68.17472536610859]
We write $p*$ for the proportion of optimal arms and $Delta$ for the minimal mean-gap between optimal and sub-optimal arms.
We characterize the optimal learning rates both in the cumulative regret setting, and in the best-arm identification setting.
arXiv Detail & Related papers (2021-03-23T11:02:31Z) - Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits
with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints.
For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.