Related papers: Non-Stationary Dueling Bandits

Non-Stationary Dueling Bandits

URL: http://arxiv.org/abs/2202.00935v1
Date: Wed, 2 Feb 2022 09:57:35 GMT
Title: Non-Stationary Dueling Bandits
Authors: Patrick Kolpaczki, Viktor Bengs, Eyke H\"ullermeier
Abstract summary: We study the non-stationary dueling bandits problem with $K$ arms, where the time horizon $T$ consists of $M$ stationary segments. To minimize the accumulated regret, the learner needs to pick the Condorcet winner of each stationary segment as often as possible. We show a regret bound for the non-stationary case, without requiring knowledge of $M$ or $T$.
Score: 8.298716599039501
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the non-stationary dueling bandits problem with $K$ arms, where the time horizon $T$ consists of $M$ stationary segments, each of which is associated with its own preference matrix. The learner repeatedly selects a pair of arms and observes a binary preference between them as feedback. To minimize the accumulated regret, the learner needs to pick the Condorcet winner of each stationary segment as often as possible, despite preference matrices and segment lengths being unknown. We propose the $\mathrm{Beat\, the\, Winner\, Reset}$ algorithm and prove a bound on its expected binary weak regret in the stationary case, which tightens the bound of current state-of-art algorithms. We also show a regret bound for the non-stationary case, without requiring knowledge of $M$ or $T$. We further propose and analyze two meta-algorithms, $\mathrm{DETECT}$ for weak regret and $\mathrm{Monitored\, Dueling\, Bandits}$ for strong regret, both based on a detection-window approach that can incorporate any dueling bandit algorithm as a black-box algorithm. Finally, we prove a worst-case lower bound for expected weak regret in the non-stationary case.

Related papers

Rising Rested Bandits: Lower Bounds and Efficient Algorithms [15.390680055166769]
This paper is in the field of sequential Multi-Armed Bandits (MABs) We study a particular case of the rested bandits in which the arms' expected reward is monotonically non-decreasing and concave. We empirically compare our algorithms with state-of-the-art methods for non-stationary MABs over several synthetically generated tasks and an online model selection problem for a real-world dataset.
arXiv Detail & Related papers (2024-11-06T22:00:46Z)
Stochastic Bandits Robust to Adversarial Attacks [33.278131584647745]
This paper investigates multi-armed bandit algorithms that are robust to adversarial attacks. We study two cases of this model, with or without the knowledge of an attack budget $C$. We devise two types of algorithms with regret bounds having additive or multiplicative $C$ dependence terms.
arXiv Detail & Related papers (2024-08-16T17:41:35Z)
Adversarial Multi-dueling Bandits [0.4467766778351321]
We introduce the problem of regret in adversarial multi-dueling bandits. We introduce a novel algorithm, MiDEX (Multi Dueling EXP3), to learn from such preference feedback.
arXiv Detail & Related papers (2024-06-18T10:28:12Z)
Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM) We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$. Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z)
Borda Regret Minimization for Generalized Linear Dueling Bandits [65.09919504862496]
We study the Borda regret minimization problem for dueling bandits, which aims to identify the item with the highest Borda score. We propose a rich class of generalized linear dueling bandit models, which cover many existing models. Our algorithm achieves an $tildeO(d2/3 T2/3)$ regret, which is also optimal.
arXiv Detail & Related papers (2023-03-15T17:59:27Z)
Variance-Aware Sparse Linear Bandits [64.70681598741417]
Worst-case minimax regret for sparse linear bandits is $widetildeThetaleft(sqrtdTright)$. In the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve an $widetildemathcal O(1)$ regret. We develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits.
arXiv Detail & Related papers (2022-05-26T15:55:44Z)
Non-stationary Bandits and Meta-Learning with a Small Set of Optimal Arms [30.024167992890916]
We study a decision where a learner faces a sequence of $K$-armed bandit tasks. The adversary is constrained to choose the optimal arm of each task in a smaller (but unknown) subset of $M$ arms. The boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting)
arXiv Detail & Related papers (2022-02-25T22:28:01Z)
Corralling a Larger Band of Bandits: A Case Study on Switching Regret for Linear Bandits [99.86860277006318]
We consider the problem of combining and learning over a set of adversarial algorithms with the goal of adaptively tracking the best one on the fly. The CORRAL of Agarwal et al. achieves this goal with a regret overhead of order $widetildeO(sqrtd S T)$ where $M$ is the number of base algorithms and $T$ is the time horizon. Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only emphlogarithmic dependence on $M$ as long
arXiv Detail & Related papers (2022-02-12T21:55:44Z)
A Regret bound for Non-stationary Multi-Armed Bandits with Fairness Constraints [7.716156977428555]
We present a new algorithm called Fair Upper Confidence Bound with Exploration Fair-UCBe for solving a slowly varying $k$-armed bandit problem. We show that the performance of our algorithm in the non-stationary case approaches that of its stationary counterpart tends to zero.
arXiv Detail & Related papers (2020-12-24T18:12:01Z)
Adversarial Dueling Bandits [85.14061196945599]
We introduce the problem of regret in Adversarial Dueling Bandits. The learner has to repeatedly choose a pair of items and observe only a relative binary win-loss' feedback for this pair. Our main result is an algorithm whose $T$-round regret compared to the emphBorda-winner from a set of $K$ items.
arXiv Detail & Related papers (2020-10-27T19:09:08Z)
Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
Regret Minimization in Stochastic Contextual Dueling Bandits [40.17224226373741]
We consider the problem of $K$-armed dueling bandit in the contextual setting. We present two algorithms for the setup with respective regret guarantees.
arXiv Detail & Related papers (2020-02-20T06:36:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.