Related papers: Thompson Sampling for Linearly Constrained Bandits

Thompson Sampling for Linearly Constrained Bandits

URL: http://arxiv.org/abs/2004.09258v2
Date: Tue, 12 May 2020 18:34:47 GMT
Title: Thompson Sampling for Linearly Constrained Bandits
Authors: Vidit Saxena, Joseph E. Gonzalez, and Joakim Jald\'en
Abstract summary: We describe LinConTS, an algorithm for bandits that place a linear constraint on the probability of earning a reward in every round. We show that for LinConTS, the regret as well as the cumulative constraint violations are upper bounded by O(log T) for the suboptimal arms.
Score: 18.477962150017095
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We address multi-armed bandits (MAB) where the objective is to maximize the cumulative reward under a probabilistic linear constraint. For a few real-world instances of this problem, constrained extensions of the well-known Thompson Sampling (TS) heuristic have recently been proposed. However, finite-time analysis of constrained TS is challenging; as a result, only O(\sqrt{T}) bounds on the cumulative reward loss (i.e., the regret) are available. In this paper, we describe LinConTS, a TS-based algorithm for bandits that place a linear constraint on the probability of earning a reward in every round. We show that for LinConTS, the regret as well as the cumulative constraint violations are upper bounded by O(\log T) for the suboptimal arms. We develop a proof technique that relies on careful analysis of the dual problem and combine it with recent theoretical work on unconstrained TS. Through numerical experiments on two real-world datasets, we demonstrate that LinConTS outperforms an asymptotically optimal upper confidence bound (UCB) scheme in terms of simultaneously minimizing the regret and the violation.

Related papers

Nearly Optimal Sample Complexity of Offline KL-Regularized Contextual Bandits under Single-Policy Concentrability [49.96531901205305]
We propose the emphfirst algorithm with $tildeO(epsilon-1)$ sample complexity under single-policy concentrability for offline contextual bandits. Our proof leverages the strong convexity of the KL regularization, and the conditional non-negativity of the gap between the true reward and its pessimistic estimator. We extend our algorithm to contextual dueling bandits and achieve a similar nearly optimal sample complexity.
arXiv Detail & Related papers (2025-02-09T22:14:45Z)
Learning to Explore with Lagrangians for Bandits under Unknown Linear Constraints [8.784438985280094]
We study problems as pure exploration in multi-armed bandits with unknown linear constraints. First, we propose a Lagrangian relaxation of the sample complexity lower bound for pure exploration under constraints. Second, we leverage the Lagrangian lower bound and the properties of convex to propose two computationally efficient extensions of Track-and-Stop and Gamified Explorer, namely LATS and LAGEX.
arXiv Detail & Related papers (2024-10-24T15:26:14Z)
Bayesian Bandit Algorithms with Approximate Inference in Stochastic Linear Bandits [21.09844002135398]
We show that Linear Thompson sampling (LinTS) and the extension of Bayesian Upper Confidence Bound (LinBUCB) can preserve their original rates of regret upper bound. We also show that LinBUCB expedites the regret rate of LinTS from $tildeO(d3/2sqrtT)$ to $tildeO(dsqrtT)$ matching the minimax optimal rate.
arXiv Detail & Related papers (2024-06-20T07:45:38Z)
Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM) We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$. Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z)
Thompson Exploration with Best Challenger Rule in Best Arm Identification [66.33448474838342]
We study the fixed-confidence best arm identification problem in the bandit framework. We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z)
Optimality of Thompson Sampling with Noninformative Priors for Pareto Bandits [81.45853204922795]
Thompson sampling has been shown to achieve problem-dependent lower bounds in several reward models. We consider the optimality of TS for the Pareto model that has a heavy tail and is parameterized by two unknown parameters. We find that TS with the Jeffreys and reference priors can achieve the lower bound if one uses a truncation procedure.
arXiv Detail & Related papers (2023-02-03T04:47:14Z)
Sample-Then-Optimize Batch Neural Thompson Sampling [50.800944138278474]
We introduce two algorithms for black-box optimization based on the Thompson sampling (TS) policy. To choose an input query, we only need to train an NN and then choose the query by maximizing the trained NN. Our algorithms sidestep the need to invert the large parameter matrix yet still preserve the validity of the TS policy.
arXiv Detail & Related papers (2022-10-13T09:01:58Z)
On Kernelized Multi-Armed Bandits with Constraints [16.102401271318012]
We study a bandit problem with a general unknown reward function and a general unknown constraint function. We propose a general framework for both algorithm performance analysis. We demonstrate the superior performance of our proposed algorithms via numerical experiments.
arXiv Detail & Related papers (2022-03-29T14:02:03Z)
A Unifying Theory of Thompson Sampling for Continuous Risk-Averse Bandits [91.3755431537592]
This paper unifies the analysis of risk-averse Thompson sampling algorithms for the multi-armed bandit problem. Using the contraction principle in the theory of large deviations, we prove novel concentration bounds for continuous risk functionals. We show that a wide class of risk functionals as well as "nice" functions of them satisfy the continuity condition.
arXiv Detail & Related papers (2021-08-25T17:09:01Z)
Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
On Frequentist Regret of Linear Thompson Sampling [8.071506311915398]
This paper studies the linear bandit problem, where a decision-maker chooses actions from possibly time-dependent sets of vectors in $mathbbRd$ and receives noisy rewards. The objective is to minimize regret, the difference between the cumulative expected reward of the decision-maker and that of an oracle with access to the expected reward of each action, over a sequence of $T$ decisions.
arXiv Detail & Related papers (2020-06-11T20:19:41Z)
A General Theory of the Stochastic Linear Bandit and Its Applications [8.071506311915398]
We introduce a general analysis framework and a family of algorithms for the linear bandit problem. Our new notion of optimism in expectation gives rise to a new algorithm, called sieved greedy (SG) that reduces the overexploration problem in OFUL. In addition to proving that SG is theoretically rate optimal, our empirical simulations show that SG outperforms existing benchmarks such as greedy, OFUL, and TS.
arXiv Detail & Related papers (2020-02-12T18:54:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.