Stochastic Contextual Dueling Bandits under Linear Stochastic
Transitivity Models
- URL: http://arxiv.org/abs/2202.04593v1
- Date: Wed, 9 Feb 2022 17:44:19 GMT
- Title: Stochastic Contextual Dueling Bandits under Linear Stochastic
Transitivity Models
- Authors: Viktor Bengs, Aadirupa Saha, Eyke H\"ullermeier
- Abstract summary: We consider the regret minimization task in a dueling bandits problem with context information.
We propose a computationally efficient algorithm, $texttCoLSTIM$, which makes its choice based on imitating the feedback process.
Our experiments demonstrate its superiority over state-of-art algorithms for special cases of CoLST models.
- Score: 25.336599480692122
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the regret minimization task in a dueling bandits problem with
context information. In every round of the sequential decision problem, the
learner makes a context-dependent selection of two choice alternatives (arms)
to be compared with each other and receives feedback in the form of noisy
preference information. We assume that the feedback process is determined by a
linear stochastic transitivity model with contextualized utilities (CoLST), and
the learner's task is to include the best arm (with highest latent
context-dependent utility) in the duel. We propose a computationally efficient
algorithm, $\texttt{CoLSTIM}$, which makes its choice based on imitating the
feedback process using perturbed context-dependent utility estimates of the
underlying CoLST model. If each arm is associated with a $d$-dimensional
feature vector, we show that $\texttt{CoLSTIM}$ achieves a regret of order
$\tilde O( \sqrt{dT})$ after $T$ learning rounds. Additionally, we also
establish the optimality of $\texttt{CoLSTIM}$ by showing a lower bound for the
weak regret that refines the existing average regret analysis. Our experiments
demonstrate its superiority over state-of-art algorithms for special cases of
CoLST models.
Related papers
- Contextual Linear Optimization with Bandit Feedback [35.692428244561626]
Contextual linear optimization (CLO) uses predictive contextual features to reduce uncertainty in random cost coefficients.
We study a class of offline learning algorithms for CLO with bandit feedback.
We show a fast-rate regret bound for IERM that allows for misspecified model classes and flexible choices of the optimization estimate.
arXiv Detail & Related papers (2024-05-26T13:27:27Z) - Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback [58.66941279460248]
Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM)
We study a model within this problem domain--contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary.
We propose an algorithm namely robust contextual dueling bandit (algo), which is based on uncertainty-weighted maximum likelihood estimation.
arXiv Detail & Related papers (2024-04-16T17:59:55Z) - Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM)
We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$.
Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z) - Stochastic Rising Bandits [40.32303434592863]
We study a particular case of the rested and restless bandits in which the arms' expected payoff is monotonically non-decreasing.
This characteristic allows designing specifically crafted algorithms that exploit the regularity of the payoffs to provide tight regret bounds.
We empirically compare our algorithms with state-of-the-art methods for non-stationary MABs over several synthetically generated tasks and an online model selection problem for a real-world dataset.
arXiv Detail & Related papers (2022-12-07T17:30:45Z) - The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret
and Policy Switches [84.54669549718075]
We study the problem of regret minimization for episodic Reinforcement Learning (RL)
We focus on learning with general function classes and general model classes.
We show that a logarithmic regret bound is realizable by algorithms with $O(log T)$ switching cost.
arXiv Detail & Related papers (2022-03-03T02:55:55Z) - Efficient and Optimal Algorithms for Contextual Dueling Bandits under
Realizability [59.81339109121384]
We study the $K$ contextual dueling bandit problem, a sequential decision making setting in which the learner uses contextual information to make two decisions, but only observes emphpreference-based feedback suggesting that one decision was better than the other.
We provide a new algorithm that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works.
arXiv Detail & Related papers (2021-11-24T07:14:57Z) - Universal and data-adaptive algorithms for model selection in linear
contextual bandits [52.47796554359261]
We consider the simplest non-trivial instance of model-selection: distinguishing a simple multi-armed bandit problem from a linear contextual bandit problem.
We introduce new algorithms that explore in a data-adaptive manner and provide guarantees of the form $mathcalO(dalpha T1- alpha)$.
Our approach extends to model selection among nested linear contextual bandits under some additional assumptions.
arXiv Detail & Related papers (2021-11-08T18:05:35Z) - Doubly robust Thompson sampling for linear payoffs [12.375561840897742]
We propose a novel multi-armed contextual bandit algorithm called Doubly Robust (DR) Thompson Sampling.
The proposed algorithm is designed to allow a novel additive regret decomposition leading to an improved regret bound with the order of $tildeO(phi-2sqrtT)$.
arXiv Detail & Related papers (2021-02-01T23:31:10Z) - Nearly Dimension-Independent Sparse Linear Bandit over Small Action
Spaces via Best Subset Selection [71.9765117768556]
We consider the contextual bandit problem under the high dimensional linear model.
This setting finds essential applications such as personalized recommendation, online advertisement, and personalized medicine.
We propose doubly growing epochs and estimating the parameter using the best subset selection method.
arXiv Detail & Related papers (2020-09-04T04:10:39Z) - Regret Minimization in Stochastic Contextual Dueling Bandits [40.17224226373741]
We consider the problem of $K$-armed dueling bandit in the contextual setting.
We present two algorithms for the setup with respective regret guarantees.
arXiv Detail & Related papers (2020-02-20T06:36:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.