Non-Stationary Dueling Bandits
- URL: http://arxiv.org/abs/2202.00935v1
- Date: Wed, 2 Feb 2022 09:57:35 GMT
- Title: Non-Stationary Dueling Bandits
- Authors: Patrick Kolpaczki, Viktor Bengs, Eyke H\"ullermeier
- Abstract summary: We study the non-stationary dueling bandits problem with $K$ arms, where the time horizon $T$ consists of $M$ stationary segments.
To minimize the accumulated regret, the learner needs to pick the Condorcet winner of each stationary segment as often as possible.
We show a regret bound for the non-stationary case, without requiring knowledge of $M$ or $T$.
- Score: 8.298716599039501
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the non-stationary dueling bandits problem with $K$ arms, where the
time horizon $T$ consists of $M$ stationary segments, each of which is
associated with its own preference matrix. The learner repeatedly selects a
pair of arms and observes a binary preference between them as feedback. To
minimize the accumulated regret, the learner needs to pick the Condorcet winner
of each stationary segment as often as possible, despite preference matrices
and segment lengths being unknown. We propose the $\mathrm{Beat\, the\,
Winner\, Reset}$ algorithm and prove a bound on its expected binary weak regret
in the stationary case, which tightens the bound of current state-of-art
algorithms. We also show a regret bound for the non-stationary case, without
requiring knowledge of $M$ or $T$. We further propose and analyze two
meta-algorithms, $\mathrm{DETECT}$ for weak regret and $\mathrm{Monitored\,
Dueling\, Bandits}$ for strong regret, both based on a detection-window
approach that can incorporate any dueling bandit algorithm as a black-box
algorithm. Finally, we prove a worst-case lower bound for expected weak regret
in the non-stationary case.
Related papers
- Stochastic Bandits Robust to Adversarial Attacks [33.278131584647745]
This paper investigates multi-armed bandit algorithms that are robust to adversarial attacks.
We study two cases of this model, with or without the knowledge of an attack budget $C$.
We devise two types of algorithms with regret bounds having additive or multiplicative $C$ dependence terms.
arXiv Detail & Related papers (2024-08-16T17:41:35Z) - Adversarial Multi-dueling Bandits [0.4467766778351321]
We introduce the problem of regret in adversarial multi-dueling bandits.
We introduce a novel algorithm, MiDEX (Multi Dueling EXP3), to learn from such preference feedback.
arXiv Detail & Related papers (2024-06-18T10:28:12Z) - Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM)
We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$.
Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z) - Borda Regret Minimization for Generalized Linear Dueling Bandits [65.09919504862496]
We study the Borda regret minimization problem for dueling bandits, which aims to identify the item with the highest Borda score.
We propose a rich class of generalized linear dueling bandit models, which cover many existing models.
Our algorithm achieves an $tildeO(d2/3 T2/3)$ regret, which is also optimal.
arXiv Detail & Related papers (2023-03-15T17:59:27Z) - Variance-Aware Sparse Linear Bandits [64.70681598741417]
Worst-case minimax regret for sparse linear bandits is $widetildeThetaleft(sqrtdTright)$.
In the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve an $widetildemathcal O(1)$ regret.
We develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits.
arXiv Detail & Related papers (2022-05-26T15:55:44Z) - Non-stationary Bandits and Meta-Learning with a Small Set of Optimal
Arms [30.024167992890916]
We study a decision where a learner faces a sequence of $K$-armed bandit tasks.
The adversary is constrained to choose the optimal arm of each task in a smaller (but unknown) subset of $M$ arms.
The boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting)
arXiv Detail & Related papers (2022-02-25T22:28:01Z) - Corralling a Larger Band of Bandits: A Case Study on Switching Regret
for Linear Bandits [99.86860277006318]
We consider the problem of combining and learning over a set of adversarial algorithms with the goal of adaptively tracking the best one on the fly.
The CORRAL of Agarwal et al. achieves this goal with a regret overhead of order $widetildeO(sqrtd S T)$ where $M$ is the number of base algorithms and $T$ is the time horizon.
Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only emphlogarithmic dependence on $M$ as long
arXiv Detail & Related papers (2022-02-12T21:55:44Z) - A Regret bound for Non-stationary Multi-Armed Bandits with Fairness
Constraints [7.716156977428555]
We present a new algorithm called Fair Upper Confidence Bound with Exploration Fair-UCBe for solving a slowly varying $k$-armed bandit problem.
We show that the performance of our algorithm in the non-stationary case approaches that of its stationary counterpart tends to zero.
arXiv Detail & Related papers (2020-12-24T18:12:01Z) - Adversarial Dueling Bandits [85.14061196945599]
We introduce the problem of regret in Adversarial Dueling Bandits.
The learner has to repeatedly choose a pair of items and observe only a relative binary win-loss' feedback for this pair.
Our main result is an algorithm whose $T$-round regret compared to the emphBorda-winner from a set of $K$ items.
arXiv Detail & Related papers (2020-10-27T19:09:08Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Regret Minimization in Stochastic Contextual Dueling Bandits [40.17224226373741]
We consider the problem of $K$-armed dueling bandit in the contextual setting.
We present two algorithms for the setup with respective regret guarantees.
arXiv Detail & Related papers (2020-02-20T06:36:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.