Online Algorithm for Unsupervised Sequential Selection with Contextual
Information
- URL: http://arxiv.org/abs/2010.12353v1
- Date: Fri, 23 Oct 2020 12:32:21 GMT
- Title: Online Algorithm for Unsupervised Sequential Selection with Contextual
Information
- Authors: Arun Verma, Manjesh K. Hanawal, Csaba Szepesv\'ari, Venkatesh
Saligrama
- Abstract summary: We study Contextual Unsupervised Sequential Selection (USS)
USS is a new variant of the contextual bandits problem where the loss of an arm cannot be inferred from the observed feedback.
We propose an algorithm for the contextual USS problem and demonstrate that it has sub-linear regret.
- Score: 38.47252956961143
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we study Contextual Unsupervised Sequential Selection (USS), a
new variant of the stochastic contextual bandits problem where the loss of an
arm cannot be inferred from the observed feedback. In our setup, arms are
associated with fixed costs and are ordered, forming a cascade. In each round,
a context is presented, and the learner selects the arms sequentially till some
depth. The total cost incurred by stopping at an arm is the sum of fixed costs
of arms selected and the stochastic loss associated with the arm. The learner's
goal is to learn a decision rule that maps contexts to arms with the goal of
minimizing the total expected loss. The problem is challenging as we are faced
with an unsupervised setting as the total loss cannot be estimated. Clearly,
learning is feasible only if the optimal arm can be inferred (explicitly or
implicitly) from the problem structure. We observe that learning is still
possible when the problem instance satisfies the so-called 'Contextual Weak
Dominance' (CWD) property. Under CWD, we propose an algorithm for the
contextual USS problem and demonstrate that it has sub-linear regret.
Experiments on synthetic and real datasets validate our algorithm.
Related papers
- Contextual Bandits with Arm Request Costs and Delays [19.263086804406786]
We introduce a novel extension of the contextual bandit problem, where new sets of arms can be requested with time delays and associated costs.
In this setting, the learner can select multiple arms from a decision set, with each selection taking one unit of time.
We design algorithms that can effectively select arms and determine the appropriate time to request new arms, thereby minimizing their regret.
arXiv Detail & Related papers (2024-10-17T00:44:50Z) - A General Framework for Clustering and Distribution Matching with Bandit Feedback [81.50716021326194]
We develop a general framework for clustering and distribution matching problems with bandit feedback.
We derive a non-asymptotic lower bound on the average number of arm pulls for any online algorithm with an error probability not exceeding $delta$.
arXiv Detail & Related papers (2024-09-08T12:19:12Z) - Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback [58.66941279460248]
Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM)
We study a model within this problem domain--contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary.
We propose an algorithm namely robust contextual dueling bandit (algo), which is based on uncertainty-weighted maximum likelihood estimation.
arXiv Detail & Related papers (2024-04-16T17:59:55Z) - Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM)
We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$.
Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z) - Stochastic Contextual Dueling Bandits under Linear Stochastic
Transitivity Models [25.336599480692122]
We consider the regret minimization task in a dueling bandits problem with context information.
We propose a computationally efficient algorithm, $texttCoLSTIM$, which makes its choice based on imitating the feedback process.
Our experiments demonstrate its superiority over state-of-art algorithms for special cases of CoLST models.
arXiv Detail & Related papers (2022-02-09T17:44:19Z) - Problem Dependent View on Structured Thresholding Bandit Problems [73.70176003598449]
We investigate the problem dependent regime in the Thresholding Bandit problem (TBP)
The objective of the learner is to output, at the end of a sequential game, the set of arms whose means are above a given threshold.
We provide upper and lower bounds for the probability of error in both the concave and monotone settings.
arXiv Detail & Related papers (2021-06-18T15:01:01Z) - Online Model Selection: a Rested Bandit Formulation [49.69377391589057]
We introduce and analyze a best arm identification problem in the rested bandit setting.
We define a novel notion of regret for this problem, where we compare to the policy that always plays the arm having the smallest expected loss at the end of the game.
Unlike known model selection efforts in the recent bandit literature, our algorithm exploits the specific structure of the problem to learn the unknown parameters of the expected loss function.
arXiv Detail & Related papers (2020-12-07T08:23:08Z) - Thompson Sampling for Unsupervised Sequential Selection [3.5436187733613087]
We study Thompson Sampling based algorithm for Unsupervised Sequential Selection (USS) problem.
USS problem is a variant of the multi-armed bandits problem, where the loss of an arm can not be inferred from the observed feedback.
We show that our Thompson Sampling based algorithm for the USS problem achieves near optimal regret and has better numerical performance than existing algorithms.
arXiv Detail & Related papers (2020-09-16T08:48:17Z) - Contextual Blocking Bandits [35.235375147227124]
We study a novel variant of the multi-armed bandit problem, where at each time step, the player observes an independently sampled context that determines the arms' mean rewards.
Playing an arm blocks it (across all contexts) for a fixed and known number of future time steps.
We propose a UCB-based variant of the full-information algorithm that guarantees a $mathcalO(log T)$-regret w.r.t. an $alpha$regret strategy in $T time steps, matching the $Omega(log(T)$ lower bound
arXiv Detail & Related papers (2020-03-06T20:34:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.