Thompson Sampling for Linearly Constrained Bandits
- URL: http://arxiv.org/abs/2004.09258v2
- Date: Tue, 12 May 2020 18:34:47 GMT
- Title: Thompson Sampling for Linearly Constrained Bandits
- Authors: Vidit Saxena, Joseph E. Gonzalez, and Joakim Jald\'en
- Abstract summary: We describe LinConTS, an algorithm for bandits that place a linear constraint on the probability of earning a reward in every round.
We show that for LinConTS, the regret as well as the cumulative constraint violations are upper bounded by O(log T) for the suboptimal arms.
- Score: 18.477962150017095
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We address multi-armed bandits (MAB) where the objective is to maximize the
cumulative reward under a probabilistic linear constraint. For a few real-world
instances of this problem, constrained extensions of the well-known Thompson
Sampling (TS) heuristic have recently been proposed. However, finite-time
analysis of constrained TS is challenging; as a result, only O(\sqrt{T}) bounds
on the cumulative reward loss (i.e., the regret) are available. In this paper,
we describe LinConTS, a TS-based algorithm for bandits that place a linear
constraint on the probability of earning a reward in every round. We show that
for LinConTS, the regret as well as the cumulative constraint violations are
upper bounded by O(\log T) for the suboptimal arms. We develop a proof
technique that relies on careful analysis of the dual problem and combine it
with recent theoretical work on unconstrained TS. Through numerical experiments
on two real-world datasets, we demonstrate that LinConTS outperforms an
asymptotically optimal upper confidence bound (UCB) scheme in terms of
simultaneously minimizing the regret and the violation.
Related papers
- Learning to Explore with Lagrangians for Bandits under Unknown Linear Constraints [8.784438985280094]
We study problems as pure exploration in multi-armed bandits with unknown linear constraints.
First, we propose a Lagrangian relaxation of the sample complexity lower bound for pure exploration under constraints.
Second, we leverage the Lagrangian lower bound and the properties of convex to propose two computationally efficient extensions of Track-and-Stop and Gamified Explorer, namely LATS and LAGEX.
arXiv Detail & Related papers (2024-10-24T15:26:14Z) - Bayesian Bandit Algorithms with Approximate Inference in Stochastic Linear Bandits [21.09844002135398]
We show that Linear Thompson sampling (LinTS) and the extension of Bayesian Upper Confidence Bound (LinBUCB) can preserve their original rates of regret upper bound.
We also show that LinBUCB expedites the regret rate of LinTS from $tildeO(d3/2sqrtT)$ to $tildeO(dsqrtT)$ matching the minimax optimal rate.
arXiv Detail & Related papers (2024-06-20T07:45:38Z) - Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM)
We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$.
Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z) - Thompson Exploration with Best Challenger Rule in Best Arm
Identification [66.33448474838342]
We study the fixed-confidence best arm identification problem in the bandit framework.
We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z) - Optimality of Thompson Sampling with Noninformative Priors for Pareto
Bandits [81.45853204922795]
Thompson sampling has been shown to achieve problem-dependent lower bounds in several reward models.
We consider the optimality of TS for the Pareto model that has a heavy tail and is parameterized by two unknown parameters.
We find that TS with the Jeffreys and reference priors can achieve the lower bound if one uses a truncation procedure.
arXiv Detail & Related papers (2023-02-03T04:47:14Z) - Sample-Then-Optimize Batch Neural Thompson Sampling [50.800944138278474]
We introduce two algorithms for black-box optimization based on the Thompson sampling (TS) policy.
To choose an input query, we only need to train an NN and then choose the query by maximizing the trained NN.
Our algorithms sidestep the need to invert the large parameter matrix yet still preserve the validity of the TS policy.
arXiv Detail & Related papers (2022-10-13T09:01:58Z) - On Kernelized Multi-Armed Bandits with Constraints [16.102401271318012]
We study a bandit problem with a general unknown reward function and a general unknown constraint function.
We propose a general framework for both algorithm performance analysis.
We demonstrate the superior performance of our proposed algorithms via numerical experiments.
arXiv Detail & Related papers (2022-03-29T14:02:03Z) - A Unifying Theory of Thompson Sampling for Continuous Risk-Averse
Bandits [91.3755431537592]
This paper unifies the analysis of risk-averse Thompson sampling algorithms for the multi-armed bandit problem.
Using the contraction principle in the theory of large deviations, we prove novel concentration bounds for continuous risk functionals.
We show that a wide class of risk functionals as well as "nice" functions of them satisfy the continuity condition.
arXiv Detail & Related papers (2021-08-25T17:09:01Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - On Frequentist Regret of Linear Thompson Sampling [8.071506311915398]
This paper studies the linear bandit problem, where a decision-maker chooses actions from possibly time-dependent sets of vectors in $mathbbRd$ and receives noisy rewards.
The objective is to minimize regret, the difference between the cumulative expected reward of the decision-maker and that of an oracle with access to the expected reward of each action, over a sequence of $T$ decisions.
arXiv Detail & Related papers (2020-06-11T20:19:41Z) - A General Theory of the Stochastic Linear Bandit and Its Applications [8.071506311915398]
We introduce a general analysis framework and a family of algorithms for the linear bandit problem.
Our new notion of optimism in expectation gives rise to a new algorithm, called sieved greedy (SG) that reduces the overexploration problem in OFUL.
In addition to proving that SG is theoretically rate optimal, our empirical simulations show that SG outperforms existing benchmarks such as greedy, OFUL, and TS.
arXiv Detail & Related papers (2020-02-12T18:54:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.