Dueling Bandits with Adversarial Sleeping
- URL: http://arxiv.org/abs/2107.02274v1
- Date: Mon, 5 Jul 2021 21:14:04 GMT
- Title: Dueling Bandits with Adversarial Sleeping
- Authors: Aadirupa Saha, Pierre Gaillard
- Abstract summary: We introduce the problem of sleeping dueling bandits with preferences and adversarial availabilities.
The goal is to find an optimal no-regret' policy that can identify the best available item at each round.
- Score: 26.308541799686505
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce the problem of sleeping dueling bandits with stochastic
preferences and adversarial availabilities (DB-SPAA). In almost all dueling
bandit applications, the decision space often changes over time; eg, retail
store management, online shopping, restaurant recommendation, search engine
optimization, etc. Surprisingly, this `sleeping aspect' of dueling bandits has
never been studied in the literature. Like dueling bandits, the goal is to
compete with the best arm by sequentially querying the preference feedback of
item pairs. The non-triviality however results due to the non-stationary item
spaces that allow any arbitrary subsets items to go unavailable every round.
The goal is to find an optimal `no-regret' policy that can identify the best
available item at each round, as opposed to the standard `fixed best-arm regret
objective' of dueling bandits. We first derive an instance-specific lower bound
for DB-SPAA $\Omega( \sum_{i =1}^{K-1}\sum_{j=i+1}^K \frac{\log
T}{\Delta(i,j)})$, where $K$ is the number of items and $\Delta(i,j)$ is the
gap between items $i$ and $j$. This indicates that the sleeping problem with
preference feedback is inherently more difficult than that for classical
multi-armed bandits (MAB). We then propose two algorithms, with near optimal
regret guarantees. Our results are corroborated empirically.
Related papers
- Adversarial Multi-dueling Bandits [0.4467766778351321]
We introduce the problem of regret in adversarial multi-dueling bandits.
We introduce a novel algorithm, MiDEX (Multi Dueling EXP3), to learn from such preference feedback.
arXiv Detail & Related papers (2024-06-18T10:28:12Z) - Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback [58.66941279460248]
Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM)
We study a model within this problem domain--contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary.
We propose an algorithm namely robust contextual dueling bandit (algo), which is based on uncertainty-weighted maximum likelihood estimation.
arXiv Detail & Related papers (2024-04-16T17:59:55Z) - Feel-Good Thompson Sampling for Contextual Dueling Bandits [49.450050682705026]
We propose a Thompson sampling algorithm, named FGTS.CDB, for linear contextual dueling bandits.
At the core of our algorithm is a new Feel-Good exploration term specifically tailored for dueling bandits.
Our algorithm achieves nearly minimax-optimal regret, i.e., $tildemathcalO(dsqrt T)$, where $d$ is the model dimension and $T$ is the time horizon.
arXiv Detail & Related papers (2024-04-09T04:45:18Z) - Adversarial Combinatorial Bandits with Switching Costs [55.2480439325792]
We study the problem of adversarial bandit with a switching cost $lambda$ for a switch of each selected arm in each round.
We derive lower bounds for the minimax regret and design algorithms to approach them.
arXiv Detail & Related papers (2024-04-02T12:15:37Z) - Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM)
We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$.
Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z) - Adversarial Sleeping Bandit Problems with Multiple Plays: Algorithm and
Ranking Application [7.0717496233701365]
The paper presents an efficient algorithm to solve the sleeping bandit with multiple plays problem in the context of an online recommendation system.
The proposed algorithm extends the sleeping bandit algorithm for single arm selection and is guaranteed to achieve theoretical performance with regret upper bounded by $bigO(kN2sqrtTlog T)$.
arXiv Detail & Related papers (2023-07-27T00:11:59Z) - Borda Regret Minimization for Generalized Linear Dueling Bandits [65.09919504862496]
We study the Borda regret minimization problem for dueling bandits, which aims to identify the item with the highest Borda score.
We propose a rich class of generalized linear dueling bandit models, which cover many existing models.
Our algorithm achieves an $tildeO(d2/3 T2/3)$ regret, which is also optimal.
arXiv Detail & Related papers (2023-03-15T17:59:27Z) - Dueling Bandits: From Two-dueling to Multi-dueling [40.4378338001229]
We study a general multi-dueling bandit problem, where an agent compares multiple options simultaneously and aims to minimize the regret due to selecting suboptimal arms.
This setting generalizes the traditional two-dueling bandit problem and finds many real-world applications involving subjective feedback on multiple options.
arXiv Detail & Related papers (2022-11-16T22:00:54Z) - Versatile Dueling Bandits: Best-of-both-World Analyses for Online
Learning from Preferences [28.79598714109439]
We study the problem of $K$-armed dueling bandit for both and adversarial environments.
We first propose a novel reduction from any (general) dueling bandits to multi-armed bandits.
Our algorithm is also the first to achieve an optimal $O(sum_i = 1K fraclog TDelta_i)$ regret bound against the Condorcet-winner benchmark.
arXiv Detail & Related papers (2022-02-14T13:37:23Z) - Adversarial Dueling Bandits [85.14061196945599]
We introduce the problem of regret in Adversarial Dueling Bandits.
The learner has to repeatedly choose a pair of items and observe only a relative binary win-loss' feedback for this pair.
Our main result is an algorithm whose $T$-round regret compared to the emphBorda-winner from a set of $K$ items.
arXiv Detail & Related papers (2020-10-27T19:09:08Z) - Improved Sleeping Bandits with Stochastic Actions Sets and Adversarial
Rewards [59.559028338399855]
We consider the problem of sleeping bandits with action sets and adversarial rewards.
In this paper, we provide a new computationally efficient algorithm inspired by EXP3 satisfying a regret of order $O(sqrtT)$.
arXiv Detail & Related papers (2020-04-14T00:41:26Z) - Regret Minimization in Stochastic Contextual Dueling Bandits [40.17224226373741]
We consider the problem of $K$-armed dueling bandit in the contextual setting.
We present two algorithms for the setup with respective regret guarantees.
arXiv Detail & Related papers (2020-02-20T06:36:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.