Thompson Sampling for Unsupervised Sequential Selection
- URL: http://arxiv.org/abs/2009.07554v1
- Date: Wed, 16 Sep 2020 08:48:17 GMT
- Title: Thompson Sampling for Unsupervised Sequential Selection
- Authors: Arun Verma, Manjesh K. Hanawal, Nandyala Hemachandra
- Abstract summary: We study Thompson Sampling based algorithm for Unsupervised Sequential Selection (USS) problem.
USS problem is a variant of the multi-armed bandits problem, where the loss of an arm can not be inferred from the observed feedback.
We show that our Thompson Sampling based algorithm for the USS problem achieves near optimal regret and has better numerical performance than existing algorithms.
- Score: 3.5436187733613087
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Thompson Sampling has generated significant interest due to its better
empirical performance than upper confidence bound based algorithms. In this
paper, we study Thompson Sampling based algorithm for Unsupervised Sequential
Selection (USS) problem. The USS problem is a variant of the stochastic
multi-armed bandits problem, where the loss of an arm can not be inferred from
the observed feedback. In the USS setup, arms are associated with fixed costs
and are ordered, forming a cascade. In each round, the learner selects an arm
and observes the feedback from arms up to the selected arm. The learner's goal
is to find the arm that minimizes the expected total loss. The total loss is
the sum of the cost incurred for selecting the arm and the stochastic loss
associated with the selected arm. The problem is challenging because, without
knowing the mean loss, one cannot compute the total loss for the selected arm.
Clearly, learning is feasible only if the optimal arm can be inferred from the
problem structure. As shown in the prior work, learning is possible when the
problem instance satisfies the so-called `Weak Dominance' (WD) property. Under
WD, we show that our Thompson Sampling based algorithm for the USS problem
achieves near optimal regret and has better numerical performance than existing
algorithms.
Related papers
- Contextual Bandits with Arm Request Costs and Delays [19.263086804406786]
We introduce a novel extension of the contextual bandit problem, where new sets of arms can be requested with time delays and associated costs.
In this setting, the learner can select multiple arms from a decision set, with each selection taking one unit of time.
We design algorithms that can effectively select arms and determine the appropriate time to request new arms, thereby minimizing their regret.
arXiv Detail & Related papers (2024-10-17T00:44:50Z) - A General Framework for Clustering and Distribution Matching with Bandit Feedback [81.50716021326194]
We develop a general framework for clustering and distribution matching problems with bandit feedback.
We derive a non-asymptotic lower bound on the average number of arm pulls for any online algorithm with an error probability not exceeding $delta$.
arXiv Detail & Related papers (2024-09-08T12:19:12Z) - Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms.
We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z) - Pure Exploration for Constrained Best Mixed Arm Identification with a Fixed Budget [6.22018632187078]
We introduce the constrained best mixed arm identification (CBMAI) problem with a fixed budget.
The goal is to find the best mixed arm that maximizes the expected reward subject to constraints on the expected costs with a given learning budget $N$.
We provide a theoretical upper bound on the mis-identification (of the the support of the best mixed arm) probability and show that it decays exponentially in the budget $N$.
arXiv Detail & Related papers (2024-05-23T22:35:11Z) - Best Arm Identification with Fixed Budget: A Large Deviation Perspective [54.305323903582845]
We present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
In particular, we present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
arXiv Detail & Related papers (2023-12-19T13:17:43Z) - Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy [71.17938026619068]
We study the top-$k$ extreme contextual bandits problem, where the total number of arms can be enormous.
We first propose an algorithm for the non-extreme realizable setting, utilizing the Inverse Gap Weighting strategy.
We show that our algorithm has a regret guarantee of $O(ksqrt(A-k+1)T log (|mathcalF|T))$.
arXiv Detail & Related papers (2021-02-15T19:10:52Z) - Online Model Selection: a Rested Bandit Formulation [49.69377391589057]
We introduce and analyze a best arm identification problem in the rested bandit setting.
We define a novel notion of regret for this problem, where we compare to the policy that always plays the arm having the smallest expected loss at the end of the game.
Unlike known model selection efforts in the recent bandit literature, our algorithm exploits the specific structure of the problem to learn the unknown parameters of the expected loss function.
arXiv Detail & Related papers (2020-12-07T08:23:08Z) - Online Algorithm for Unsupervised Sequential Selection with Contextual
Information [38.47252956961143]
We study Contextual Unsupervised Sequential Selection (USS)
USS is a new variant of the contextual bandits problem where the loss of an arm cannot be inferred from the observed feedback.
We propose an algorithm for the contextual USS problem and demonstrate that it has sub-linear regret.
arXiv Detail & Related papers (2020-10-23T12:32:21Z) - The Price of Incentivizing Exploration: A Characterization via Thompson
Sampling and Sample Complexity [83.81297078039836]
We consider incentivized exploration: a version of multi-armed bandits where the choice of arms is controlled by self-interested agents.
We focus on the price of incentives: the loss in performance, broadly construed, incurred for the sake of incentive-compatibility.
arXiv Detail & Related papers (2020-02-03T04:58:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.